Cloud & DevOps

Sub-Millisecond VM Sandboxes via Copy-on-Write

Starting a Docker container takes about 500 milliseconds. A Firecracker microVM takes about 125 milliseconds. A V8 isolate takes about 5 milliseconds. But a new breed of lightweight sandboxes, using copy-on-write memory forking, can spin up an isolated execution environment in under 1 millisecond — often in the 50-200 microsecond range. That's fast enough to create a fresh sandbox for every single function call.

This isn't just an incremental improvement. It's a qualitative change in what sandboxing can do. When sandbox creation costs 500ms, you create them sparingly and reuse them. When it costs 50μs, you create them for every untrusted input, every plugin invocation, every user request. The security model shifts from 'isolate tenants' to 'isolate individual operations.'

What Copy-on-Write Actually Means

Copy-on-write (CoW) is an operating system technique where you create a 'copy' of a memory region without actually copying any data. Both the original and the copy point to the same physical memory pages, marked as read-only. They're indistinguishable — both see identical data. The actual copy only happens when one of them tries to write to a page. At that point, the kernel intercepts the write, copies just that single page, and lets the write proceed on the copy.

Unix's fork() system call has used this since the 1990s. When you fork a process, the child gets a complete copy of the parent's memory — but thanks to CoW, no data is actually copied. If the child immediately calls exec() (as is typical), it replaces its memory entirely and the CoW pages are simply released. The fork was nearly free.

// Classic fork() — copy-on-write in action
pid_t pid = fork();
// At this point:
// - Parent and child share ALL physical memory pages
// - Both have identical virtual address spaces
// - Pages are marked read-only in both
// - Zero data has been copied
if (pid == 0) {
// Child process
// Reading shared_data is free — same physical page
int value = shared_data[0];
// Writing triggers CoW — only THIS page gets copied
shared_data[0] = 42;
// Now child has its own copy of this one page
// Parent's page is unchanged
}

From Fork to Sandbox

The CoW sandbox insight is: instead of booting a fresh VM or container from scratch, pre-boot a 'template' environment (with runtime, libraries, and initial state loaded), then fork it using CoW to create instant copies. Each copy starts in exactly the state where the template left off — fully initialized, ready to execute — but runs in its own isolated memory space.

The performance difference is enormous. Traditional VM startup involves: load a kernel, initialize hardware, mount filesystems, start the init system, load application code, initialize the runtime. Even with aggressive optimization (Firecracker strips this down significantly), you're still doing hundreds of milliseconds of initialization work.

CoW forking skips all of that. The template has already done the initialization. The fork creates a pre-initialized copy in microseconds. The 'startup cost' is just the kernel bookkeeping for creating a new address space and duplicating page table entries — a few thousand operations regardless of how much memory the template uses.

Where This Changes the Game

Serverless Functions

Cold starts are the bane of serverless computing. AWS Lambda takes 100-500ms for a cold start, more for JVM-based runtimes. This is unacceptable for latency-sensitive workloads, which forces users to keep instances warm (defeating the purpose of serverless) or accept unpredictable latency.

With CoW sandboxes, cold starts drop to sub-millisecond. Every invocation can be a 'cold start' because cold starts are essentially free. No need for warm pools, no memory waste from idle instances, no stale state between invocations. Each function execution gets a pristine, isolated environment without paying the initialization cost.

Plugin and Extension Systems

Running untrusted plugins safely is one of the hardest problems in software. Browsers solved it for JavaScript with V8 isolates. But for arbitrary code — compiled extensions, scripting languages, binary plugins — isolation options have been limited to containers (too slow for per-request isolation) or WebAssembly (limited ecosystem and language support).

CoW sandboxes offer a middle path: run arbitrary code in an isolated copy of the host environment, with sub-millisecond setup and teardown. The plugin sees a full OS environment (filesystem, network, libraries) but its modifications are contained — when the sandbox exits, all changes vanish. This is ideal for user-submitted code in CI systems, notebook environments, and build tools.

Security Isolation

When processing untrusted input — parsing an uploaded PDF, rendering user-provided HTML, executing a database query — running the operation in an isolated sandbox limits the blast radius of any exploit. If the PDF parser has a buffer overflow, the attacker gains control of a disposable sandbox that's about to be destroyed, not the application server.

This approach — process isolation for every untrusted operation — has been impractical with traditional sandboxing because the overhead exceeded the processing time. If parsing a PDF takes 10ms, spending 500ms on container creation makes no sense. But spending 100μs on CoW sandbox creation is trivially cheap.

The Implementation Details

Building a practical CoW sandbox system requires solving several problems beyond just calling fork().

  • Memory accounting. CoW makes memory usage ambiguous. If a template uses 1 GB and you fork 100 copies that each modify 10 MB, the physical memory usage is ~2 GB (1 GB shared + 100 × 10 MB unique), not 100 GB. The kernel tracks shared vs. private pages, but getting accurate per-sandbox usage requires parsing /proc/[pid]/smaps.
  • Filesystem isolation. CoW handles memory, but filesystem writes need separate isolation. Overlay filesystems (overlayfs) provide CoW semantics for files: the sandbox sees the template's filesystem, but writes go to a separate layer. On sandbox exit, the overlay is discarded.
  • Network isolation. Each sandbox needs its own network namespace to prevent interference. Linux namespaces provide this, but creating network namespaces has measurable overhead. Some systems reuse a pool of pre-created namespaces.
  • Resource limits. A sandbox that allocates unbounded memory or consumes unlimited CPU is a denial-of-service vector. cgroups provide resource limits (memory, CPU, I/O), but cgroup creation/destruction adds overhead. Again, pooling helps.
  • Deterministic cleanup. When a sandbox exits, all its resources — memory, file descriptors, network connections, IPC objects — must be reliably cleaned up. PID namespaces help: kill the namespace's init process and all descendants are killed.

CoW vs WebAssembly Sandboxes

WebAssembly (Wasm) is the other major lightweight sandboxing technology. It's worth comparing the two because they make fundamentally different trade-offs.

Wasm sandboxes run code in a memory-safe virtual machine with a linear memory model. The sandbox can't access anything outside its linear memory — no filesystem, no network, no system calls (unless explicitly provided through WASI). This is extremely secure but constraining: existing code must be recompiled to Wasm, and not all languages compile to Wasm effectively.

CoW sandboxes run native code in an isolated OS environment. The sandbox has access to a full OS interface (possibly restricted by seccomp filters), can run any binary, and uses normal system libraries. This is less restrictive but less secure — the isolation boundary is the OS process model, which has a larger attack surface than Wasm's minimal VM.

Choose Wasm when: you control the code being sandboxed, your workload compiles cleanly to Wasm, and you need the strongest possible isolation. Choose CoW sandboxes when: you need to run arbitrary existing binaries, your workload needs OS-level capabilities (filesystem, network, child processes), and you want to prioritize compatibility over minimal attack surface.

The Catch: Fork in Multithreaded Programs

There's a well-known pitfall with fork(): it only copies the calling thread. If the parent has 20 threads, the child gets one. Any mutexes held by the other 19 threads are still marked as held in the child's memory, but the threads holding them don't exist. The child will deadlock the first time it tries to acquire one of those mutexes.

CoW sandbox systems work around this by ensuring the template process is single-threaded at the moment of forking. This typically means: initialize everything in the template (load libraries, set up the runtime, prepare initial state), then stop all threads except the main one, fork, and let each child respawn threads as needed. The initialization cost is paid once; the fork avoids the threading hazard.

Some newer approaches use userfaultfd or custom page fault handlers to implement CoW-like semantics without relying on fork() at all. These avoid the multithreading issue but add complexity and require more kernel-level coordination.

What to Watch

Sub-millisecond sandboxing is still early, but the building blocks are solid (fork, namespaces, cgroups, overlayfs are all mature). The systems being built on top — for serverless computing, CI/CD, and secure code execution — are proving that per-operation isolation is practical at scale. As these tools mature, the assumption that sandboxing is expensive will become as outdated as the assumption that garbage collection is too slow for real-time applications. The overhead is vanishing, and the security benefits of 'sandbox everything' are becoming hard to ignore.