Suraj logo

Isolating Compute at Scale

Posted on Jun 2, 2023

I wanted to build a blazingly fast on-demand isolated environment for running untrusted code at scale for my startup. In this post, I will describe my journey of building such a system.

The Problem

The problem statement is quite simple. I want to let users run untrusted code in an isolated environment. The code could be anything, a simple hello world program or a vulnerable web application. The set of requirements for such a system are as follows.

  • Isolated environment
  • Fast boot times
  • Scalable system
  • Cost effective
  • Easy to maintain
  • Less engineering effort

There are a lot of startups like Zapier, Replit, StackBlitz, etc that require on-demand isolated environments for running user workloads. They all have their own custom solutions for this problem. And in my case, I wanted to run vulnerable code as a service so people can learn security by breaking it safely.

Attempt #1 - Virtual Machines

A naive architectural approach to isolated environments is to spin up Virtual Machines (VMs) for each user. We can achieve this using IaC tools like AWS CDK. VMs provide great isolation, but they are often slow in their boot times. A few optimizations for speeding up the boot times are as follows.

  • Custom VM images.
  • Setup an pool of VMs in an auto-scaling group. Based on the traffic, we can scale up/down the number of idle VMs (with a cap).

Although this approach works, the scaling is not cost effective. VMs are expensive, and we are paying for the idle VMs. Additionally, compute resources are not utilized efficiently.

The following diagram shows the architecture of this approach.

VM based Isolation at ScaleVM based Isolation at Scale

Users provision VMs by hitting an AWS Lambda function. The function checks for the VM's availability, schedules instance creation if necessary, updates the database, updates the nginx configurations and returns a subdomain pointing to that instance. Also, notice there isn't master node. The lambda acts as a serverless master overseeing the orchestration.

The takeaways from this approach are as follows.

I built a prototype of this on AWS. But soon realised this would make me go broke pretty soon, so I had to find a better solution.

Attempt #2 - Containers and Custom Orchestration

A better approach might be using containers like Docker. Containers are known for their fast boot times, and they are cost effective. They are not as great with isolation like VMs because there's no seperation at a hypervisor level. Infact containers share the host machine's kernel. But we can achieve better isolation in containers using gVisor, kata containers or nsjail.

gVisor is an application kernel for containers. It provides an Open Container Initiative (OCI) compliant runtime called runsc that provides an isolation boundary between the application and the host kernel. It sits in-between them, intercepting application system calls and emulating them using host kernel primitives. gVisor is used by Google Cloud for their serverless platform.

Kata containers provides kata-runtime which is also an OCI compliant runtime that executes containers as lightweight virtual machines. Each container gets it's own kernel, so there's no shared host kernel anymore. And it uses KVM behind the scenes for virtualization.

The above two isolation solutions are OCI compliant runtimes. But nsjail isn't. Nsjail a process isolation tool for Linux. It utilizes namespaces (like docker), seccomp-bpf and capabilities to sandbox processes in a light-weight fashion. We can write rules to restrict the process from accessing the host system. Google's CTF infrastructure KCTF uses nsjail for sandboxing.

Now for the orchestration part, I wanted to roll out my own. I named it "Arc" and the architecture is as follows.

Arc Container OrchestratorArc Container Orchestrator

In the above orchestration architecture, there are a bunch of EC2 instances in an auto-scaling group(capped). All the nodes have Arc agents installed in them. On a node startup, it runs a docker container with the Arc agent. Master nodes are manually configured as Arc masters. And the orchestration data is stored on etcd.

Worker nodes send out gRPC heartbeats to the master node every second. Master node maintains a list of active nodes and their resources.

When a user requests for an isolated environment, an external API pushes the task onto the Redis queue after the authorization. The master consumes the task from the queue, and creates a container on the right worker node. The master node uses the Arc agent to start/stop containers on the worker nodes.

Then the Arc agent uses the docker daemon to start a container with either gVisor or kata containers as the container runtime.

A common approach is to distribute the workload in a balanced fashion, essentially a round-robin. But I chose to distribute the workloads in a concentrated fashion, for effective resource utilization.

Additionally, I planned to group uncorrelated workloads together due to correlated load, statistically speaking. I stole this idea from AWS Lambda.

The key takeaways from this architecture are as follows.

I never implemented the entire architecture, I gave up half way through implementing Raft consensus. I realised that I was building this distributed system more than my actual product. Also the architecture was finicky, so for the next attempt I wanted to use an existing orchestration solution. And I'm sure your reaction to this would be basically "BRUHHH".

Attempt #3 - Docker Swarm

This time I wanted to use Kubernetes, but I liked the simplicity of setting up Docker Swarm, so I went with it. I achieved isolation and speed like I mentioned previously. But this time I used nsjail for isolation.

Here's the architecture.

Docker Swarm Isolation at ScaleDocker Swarm Isolation at Scale

The architecture works similar to my Arc implementation. But one extra component it has is the Traefik edge proxy. It has built-in service discovery, so it can route traffic to the right container based on the subdomain. And it also handles SSL.

Also notice the use of DynamoDB. It's mainly used for storing the orchestration data.

So here are the takeaways from this architecture.

I implemented this architecture and had a working prototype. It was good enough for the MVP. But I had a change of mind thinking I could make it better. So here's my last attempt.

Final Attempt - Nomad

I'll keep this one short. Instead of docker swarm, I chose Nomad, simply because it's easier to setup and maintain. One single binary is all it takes. It's also more flexible with the type of workload it can run. It supports docker, qemu, raw processes and heck there's even a java task driver.

So I could use a microVM like Firecracker for isolation. It's a project by Amazon, and it's used in their AWS serverless platform.

Additionally, I chose to use Consul for service discovery. As a bonus it also provides health checks and a key-value store.

Here's the final architecture.

Nomad Isolation at ScaleNomad Isolation at Scale

It's similar to the docker swarm architecture, but this time traefik uses consul catalog for service discovery. Essentially, offloading the service discovery to consul.

The takeaways from the final architecture are as follows.

I never implemented this architecture, but I'm pretty sure it would work.

Devils are in the Details

I quickly went over all the architectures, but there are a lot of details that I skipped. I'll try to cover them here.

Scaling

Although there isn't much performance benchmarks on scaling docker swarm, I could run about ~330 containers per node with 2 vCPUs and 4GB of RAM. At one point I had about 8 worker nodes and 3 manager nodes. Since user workloads are only run on worker nodes, I could run about ~2200 containers comfortably.

Remember these are very small whoami http services, and not heavy workloads. Also, these were idle containers, so I do not have benchmarks with the traffic load.

Apparently Nomad can scale to 2 million containers with 6100 hosts, so maybe it's a better choice? I'll leave that to you.

Also I'm not sure how well traefik can handle the sheer number of containers because I had some service discovery issues during my development.

Routing mesh and Edge proxy

The load balancer is the entry point to the system. In case of docker swarm, I used the Elastic Load Balancer (ELB), which pointed to all nodes of the swarm. When the traffic from ELB hits the swarm's routing mesh, it automatically routes to my traefik instances. Traefik then routes the traffic to the right container based on the subdomain.

Network Security

The above mentioned OCI runtime solutions doesn't necessarily provide isolation on the network level. An attacker could still talk to cloud APIs like 169.254.169.254 to retrieve the instance metadata or talk to DNS resolver for the VPC at 169.254.169.253 and so on. So to isolate on a network level, we could use iptables to restrict the network access.

For iptables, CAP_NET_RAW capability is required. You can add it to your runtime via runtimeArgs.

// /etc/docker/daemon.json
{
  "default-runtime": "runsc",
  "runtimes": {
    "runsc": {
      "path": "/usr/local/bin/runsc",
      "runtimeArgs": ["--net-raw", "--overlay"]
    }
  }
}

Streaming service logs

I wanted to stream the terminal logs to the user in real-time. To achieve this, I streamed logs from the container to the user via websockets and used xterm.js as the terminal emulator.

Queueing tasks

Redis is prolly not the best queueing solution out there, since I was familiar with it, I chose it. An alternative to queueing and consuming tasks could be Temporal workflows, but I didn't get a chance to try it out.

Caching

The container images sit on private container registry like AWS ECR. And necessary files/scripts are cached on a network file system like AWS EFS.

Background workers

I had a bunch of background workers for cleaning up stale containers. I had also setup a job that dealt with hard and soft limits. Soft limits were used to warn the user that their container is about to be terminated due to being idle for a while. And hard limits were used to terminate the container after a long idle time. Since these were stateless containers, I could terminate them anytime and bring them back up quickly when needed.

Observability

I used Prometheus for monitoring and Grafana for visualization. Additionally, a few scripts monitored the health and performance of the system, and notified me via slack.

I also used appropriate labels like user_id on a container and service level, so I could easily identify and filter services/containers for debugging.

WebAssembly

A different approach to isolation at scale would be to use WebAssembly. So instead of running a container or a micro VM on the server-side, I could port the application to WebAssembly and run it on a server or even a browser. There are ports of Python and bunch of projects like webcontainers and v86 that could help faliciate this approach. But it's still new tech, and I'm hoping to see some future developments in this space.

Conclusion

As you can tell I really enjoyed building systems, and iterating on optimizing them for performance. Good thing was that I learnt a few things along the way, but the downside was that I was in the engineering hat for too long. As a founder I should move fast, but I didn't. Anyways, I hope this was useful, if you have feedback or criticism, feel free to reach out to me anytime.

Get notified on new posts

If you enjoyed this post and want to get notified on more posts like this, submit your email below and I'll keep you updated.

Keep Reading