Skip Navigation

What we learned building safe code execution on Cloudflare

CodeAICloudflareDebuggingTaskless
A sandbox with shield, a green checkmark in the codedrift signature green color

Taskless generates runtime rules: small checks that go beyond static analysis to confirm a codebase actually behaves the way it should. Some of those checks are model-generated code, and model-generated code has to run somewhere you trust it not to do damage. We built that somewhere on Cloudflare Containers and the Sandbox SDK. It's live today, rolling out slowly so we can get it right.

What's not going to be in the official blog post is the 18 to 20 hours I lost closing the gap between the Cloudflare docs and reality. This post is most of that gap written down, so the next person building on Containers spends those hours shipping instead of bisecting.

I don't want this to be just about the gotchas, though. There was a lot of work we didn't have to do because of the Cloudflare platform. Isolation, scale-to-zero billing, a real Linux box at the edge, a lifecycle we never had to operate. Containers gave us all of it. The rough edges below are real, and Containers still feel unpolished next to Workers and Durable Objects, and even with those rough edges we still think Cloudflare is the right choice for what we're building.

The build that would not build

My rabbit hole started with a single line in a Dockerfile. We use ast-grep for the static-analysis layer (more on why later), so the generator image installs the CLI:

RUN npm install -g @ast-grep/cli@0.41.1 tsx@4.19.4 typescript@5.9.3 @types/node@22.19.13

On Cloudflare's build, that step hung for about eight and a half minutes and then died:

npm error code EAI_AGAIN
npm error syscall getaddrinfo
npm error errno EAI_AGAIN
npm error request to https://registry.npmjs.org/@ast-grep%2fcli failed,
reason: getaddrinfo EAI_AGAIN registry.npmjs.org

EAI_AGAIN is a DNS resolution failure. The build environment couldn't resolve registry.npmjs.org long enough to pull the package. It read like a transient network blip, so the first few times I kicked the can and retried. It kept coming back.

That one issue was more than half the time I spent chasing documentation. It was only when I stood up a single Dockerfile and tried to pull anything at all down via npm that I considered the issue might not be anything I was doing at all. My best read is that the failure lives somewhere in Cloudflare's container build path and isn't documented anywhere I could find. It might be written down somewhere. I never found it, and I looked hard.

The fix was to stop building the image on Cloudflare and move the whole thing, image build plus wrangler deploy, into GitHub Actions, where installing ast-grep is a non-event. Running our Docker assembly on GitHub Actions has its own problems for hardening, and we had to do a pass to pin SHAs for every tool required in the pipeline. It was good hygiene. It was just stuff we didn't plan on having to do.

The part that stung most came at the end. wrangler deploy --dry-run reported everything green. The real wrangler deploy then fell apart, because the dry run doesn't emulate the full deploy path. There's a special flavor of pain in a tool telling you it's safe to proceed and then failing the moment you actually proceed. If you take one operational habit from this post: don't trust --dry-run as proof, trust a real deploy to a throwaway environment. The release was rough (for me), but nothing ever went down. My Twitch community even came along for the ride as we deployed our way through integration hell that Thursday afternoon.

The activity timer is the thing you're actually fighting

A Container is a Durable Object that runs a Docker image. Hold onto that sentence, because almost every surprise below is downstream of it. The Sandbox SDK (@cloudflare/sandbox) is a friendlier API on the same primitive, and it inherits every platform behavior underneath without hiding any of it.

A Container stays alive because a timer in the Durable Object's JavaScript layer keeps getting renewed. Our first foray into sandboxes tripped on exactly this. container.fetch() renews the timer, because the HTTP request stays in flight through the DO's JS layer. container.start() is fire-and-forget: the timer counts down immediately and sleepAfter can kill the container before the work you started ever finishes. Model container work as request-response over HTTP and pass config in the fetch body, and the whole class of problem goes away. This is also how most of the Sandbox SDK execution commands work, once you dig into the SDK code.

From there it gets subtle, and here the platform's own docs led people wrong. A WebSocket looks like a long-lived connection that should hold the container open. It doesn't. After the upgrade, frames proxy at the TCP layer beneath the DO's JS, so the activity timer never sees them and the container sleeps with a live socket attached (containers #147). Busy CPU doesn't save you either: a community FFmpeg transcode pinned at 100% got a SIGTERM at 8 to 10 minutes mid-job (containers #138), because compute inside the container is invisible to the DO. A busy container isn't a kept-alive container. Long jobs need an out-of-band heartbeat and need to be resumable.

The SDK stacks its own limit on top, and this one we tripped ourselves. Our first static-analysis prototype did a git checkout and ran a Claude CLI command inside the sandbox. Claude isn't known for its speed, and we were definitely retrying generation stages, because we'd get a closed connection in the middle of execution. The cause: the SDK's WebSocket transport caps a single operation at 120000ms (sandbox-sdk #398), so a long execStream dies with timeout after 120000ms. We reached for WebSockets in the first place because the HTTP transport spends one Workers subrequest per exec, so chatty sessions blow the 50-subrequest cap. HTTP fails on frequency, WebSockets fail on duration. We ended up picking timeout-based options and making sure anything we ran fit inside two minutes.

Silence is the default failure mode

The platform's worst habit is reporting success while the real work is broken, with nothing in the logs to catch it.

The one that cost us directly was environment variables. envVars looks like the config channel and isn't. Values set on the Container class or through start({ envVars }) don't reliably reach os.environ or process.env. Docker trained all of us to configure containers through the environment, so our first prototype did exactly that and quietly ran with empty config. The fix is to configure per request: send JSON in the fetch body and read it on every invocation. That change turned out to fit how Taskless actually works, since we run these checks without scanning the customer's whole codebase, so per-request config was the right shape anyway.

A few more failures in the same family, reported by others and worth knowing before they cost you a day:

  • The default lite instance (256MB, 1/16 vCPU) silently OOM-kills with no crash log, the instance just goes inactive. It never bit us, but sandbox performance was genuinely bad until we optimized what the sandbox was doing, so don't assume the smallest box is a free lunch. Reach for standard-2 for anything memory-heavy.
  • Setting max_concurrency above 1 silently fails 20 to 30 percent of starts (workerd #5996). start() returns success, running reports true, and the entrypoint never runs.
  • killProcess() reports success while killing only the bash wrapper, orphaning the real process (sandbox-sdk #338).

The theme across all of these is that the platform's own success signal is structurally incapable of detecting the failure. Bring your own external verification. A cheap exec('echo alive', { timeout: 5000 }) liveness probe tells you more than running === true ever will.

The container is walled off, in both directions

A Container's only ingress is Worker HTTP. No public IP, no raw TCP. Egress back out is the sharp part, and it shaped our architecture directly. A container can't call a Worker, custom domain or .workers.dev. We found that boundary early and didn't fight it. Instead we switched to a one-way flow: the container writes its results to the filesystem, and we read them back off the response body of the container.fetch() call the Worker is already holding open. No callback, no second connection, nothing to get blocked.

Worth knowing if you try to push on that boundary anyway: a plain container-to-Worker request returns a clean 403, but a streaming-body PATCH returns a bare "internal error" before the request reaches the container, and a streaming POST returns 500 (containers #116, #98). Those look like transient 5xx faults and invite pointless retries. They're hard architectural blocks. Related edges others have hit: a *.trycloudflare.com tunnel buffers text/event-stream end to end so SSE never flushes on a live stream (WebSockets tunnel cleanly), and R2 reads reportedly fail past 10 to 15MB regardless of client (containers #137).

What we shipped, and why the constraints helped

Here's the workload all of this was in service of. When the generator produces a runtime rule, it emits a check.ts plus ast-grep capture rules, and we run that generated check against fixtures inside the sandbox. It iterates until the check passes its fixture, which proves the rule catches what it's supposed to catch on real examples. Then we sign the check.ts so it can't be tampered with after the fact.

The sandbox is the point of least trust in the whole system, and that's deliberate. During that generation window we don't yet know what the generated code will do, so we give it nothing to abuse. The network is stripped, so there's no external access. There are no secrets. The only thing present is the fixture code under test. Even in a worst-case, the-model-wrote-something-malicious scenario, there's nothing on that box to take and nowhere to send it. We can't drive the risk to zero, but we can make sure a rogue check runs in an empty room.

The signature is what lets trust climb back up later. Because the check.ts was validated and signed in that clean, locked-down environment, the Taskless CLI can assert exactly what it intends to run and block anything tampered with. That's what will eventually let us turn network egress back on where a real check needs it, for a shallow clone during a Taskless Check, without reopening the risk we closed at generation time. Least trust when we know the least, earned trust once the signature guarantees integrity.

That's the shape Cloudflare is genuinely good at. A verification run spins up, reads code that already exists, emits a signed result, and tears down in seconds. None of the long-tail lifecycle failures land hard on bounded, stateless work. The workloads that fight the platform are the builder kind: durable filesystem, long-lived process, open-ended agentic loops. The platform guarantees none of that, disk is fully ephemeral, hosts restart without an uptime guarantee, and a runaway Claude Code loop can pin a CPU core with no graceful recovery and bill you for the runaway (sandbox-sdk #416). If your file operations can run against object storage instead of a real filesystem, route through a Worker and skip the sandbox. Ours can, so most of Taskless does exactly that, and the sandbox stays reserved for the narrow, bounded verification case it's good at.

A word on the ast-grep bet, since installing it is what sent me down this hole. We wanted one static-analysis tool portable enough to cover many languages, with the modularity to add new ones later. ast-grep sits on tree-sitter and gives us that: one cross-language sweep instead of a zoo of native-optimal linters, each with its own rules to maintain. It won't beat a purpose-built native tool like ruff on raw speed, but a single sweep across the whole codebase beats maintaining a dozen toolchains, and that trade is the whole point. Taskless also runs alongside the tools you already have, eslint, ruff, rubocop, and the rest.

So that's the version that won't make the official post: one undocumented DNS failure, a deploy pipeline I didn't plan to build, a Thursday afternoon of integration hell with my stream chat watching, and a verification sandbox that's now live and doing exactly what we needed. If you're pointing model-generated code at Cloudflare Containers, know which workload you're running before you commit, and build your tooling around the platform's real edges instead of the documented ones. Cloudflare has some pretty strong limitations, but staying within those lines gives you a startling amount for free.