Launch HN: Twill.ai (YC S25) – Delegate to cloud agents, get back PRs

danoandco 65 points 58 comments April 10, 2026

Hey HN, we're Willy and Dan, co-founders of Twill.ai ( https://twill.ai/ ). Twill runs coding CLIs like Claude Code and Codex in isolated cloud sandboxes. You hand it work through Slack, GitHub, Linear, our web app or CLI, and it comes back with a PR, a review, a diagnosis, or a follow-up question. It loops you in when it needs your input, so you stay in control. Demo: https://www.youtube.com/watch?v=oyfTMXVECbs Before Twill, building with Claude Code locally, we kept hitting three walls 1. Parallelization: two tasks that both touch your Docker config or the same infra files are painful to run locally at once, and manual port rebinding and separate build contexts don't scale past a couple of tasks. 2. Persistence: close your laptop and the agent stops. We wanted to kick off a batch of tasks before bed and wake up to PRs. 3. Trust: giving an autonomous agent full access to your local filesystem and processes is a leap, and a sandbox per task felt safer to run unattended. All three pointed to the same answer: move the agents to the cloud, give each task its own isolated environment. So we built what we wanted. The first version was pure delegation: describe a task, get back a PR. Then multiplayer, so the whole team can talk to the same agent, each in their own thread. Then memory, so "use the existing logger in lib/log.ts, never console.log" becomes a standing instruction on every future task. Then automation: crons for recurring work, event triggers for things like broken CI. This space is crowded. AI labs ship their own coding products (Claude Code, Codex), local IDEs wrap models in your editor, and a wave of startups build custom cloud agents on bespoke harnesses. We take the following path: reuse the lab-native CLIs in cloud sandboxes. Labs will keep pouring RL into their own harnesses, so they only get better over time. That way, no vendor lock-in, and you can pick a different CLI per task or combine them. When you give Twill a task, it spins up a dedicated sandbox, clones your repo, installs dependencies, and invokes the CLI you chose. Each task gets its own filesystem, ports, and process isolation. Secrets are injected at runtime through environment variables. After a task finishes, Twill snapshots the sandbox filesystem so the next run on the same repo starts warm with dependencies already installed. We chose this architecture because every time the labs ship an improvement to their coding harness, Twill picks up the improvement automatically. We’re also open-sourcing agentbox-sdk, https://github.com/TwillAI/agentbox-sdk , an SDK for running and interacting with agent CLIs across sandbox providers. Here’s an example: a three-person team assigned Twill to a Linear backlog ticket about adding a CSV import feature to their Rails app. Twill cloned the repo, set up the dev environment, implemented the feature, ran the test suite, took screenshots and attached them to the PR. The PR needed one round of revision, which they requested through Github. For more complex tasks, Twill asks clarifying questions before writing code and records a browser session video (using Vercel's Webreel) as proof of work. Free tier: 10 credits per month (1 credit = $1 of AI compute at cost, no markup), no credit card. Paid plans start at $50/month for 50 credits, with BYOK support on higher tiers. Free pro tier for open-source projects. We’d love to hear how cloud coding agents fit into your workflow today, and if you try Twill, what worked, what broke, and what’s still missing.

Discussion Highlights (16 comments)

Mr_P

How does this compare to Claude Managed Agents?

hmokiguess

> Run the same agent n times to increase success rate. Are there benchmarks out there that back this claim?

hardsnow

I’ve been developing an open-source version of something similar[1] and used it quite extensively (well over 1k PRs)[2]. I’m definitely believer of the “prompt to PR model”. Very liberating to not have to think about managing the agent sessions. Seems that you have built a lot of useful tooling (e.g., session videos) around this core idea. Couple of learnings to share that I hope could be of use: 1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools. 2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out. 3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”. [1] https://airut.org [2] https://haulos.com/blog/building-agents-over-email/

2001zhaozhao

24/7 running coding agents are pretty clearly the direction the industry is going now. I think we'll need either on-premises or cloud solutions, since obviously if you need an agent to run 24/7 then it can't live on your laptop. Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?

gbnwl

So instead of using my Claude Code subscription, I can pay the vastly higher API rates to you so you can run Claude Code for me?

dennisy

Congrats on the launch, the agentbox-sdk looks interesting, but seeing as the first commit was 3 days ago - I feel a little wary to use it just yet! One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"? Also how do you add more agent types, do you support just ACP?

a_t48

Does it support running Docker images inside the sandbox?

senordevnyc

How does this compare to something like Cursor Cloud Agents with a solid set of skills and tools?

auszeph

I built an internal version of this for my workplace. Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name. It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck

wordpad

How does this compare to Jules from Google?

eranation

Edit: just noticed this is a semi duplicate question to https://news.ycombinator.com/item?id=47723506 so rephrasing my question - will you have computer use and will you have self-hosted runners option? (you being just the controlplane / task orchestrator, which is the hardest problem apparently...) Additional question - what types of sandboxes you use? (just docker or also firecracker etc...) Original comment: Congrats on the launch! What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?) https://cursor.com/blog/agent-computer-use Or the existing Claude Code Web?

eranation

HN hug of death probably, but your scorecard returns an error :( The analysis request failed. Hosted shell completed without parseable score_repo.py JSON output. 11 command(s), 11 output(s). (rest redacted)

qainsights

Cool. Tried my side project ai.dosa.dev to create an utility; it did good. PR https://github.com/QAInsights/awesome-ai-tools/pull/23

cocoflunchy

Great timing as I'm exploring the space to get rid of Cursor in our stack. For local dev everyone is switching to Claude Code or Codex. The state of the art for cloud agents in my opinion right now is Cursor. But their pricing model per-user doesn't make sense when what I want is to enable anyone in the company to fix things in the product. 2 things not immediately clear from your homepage: - do you support full computer use? Again Cursor is the best I've tried there - what kind of triggers do you support? We have in particular one automation built with cursor to auto approve PRs that are low-risk. It triggers on a specific comment on a PR Finally some advice from a user's pov: you need to invest a lot in the onboarding experience. I tried Devin today and it couldn't get it to work after one hour of fiddling. How do you store the repo's setup scripts? Cursor cloud is pretty opaque and annoying to configure on that side. Anyway I'll try it!

woeirua

I think Cloud Agents are the future, but I’ll be honest I don’t see how a third party provider survives in this space. 1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already. 2. Anthropic and Cursor are already playing in this space and likely will eat your lunch. IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.

kuzivaai

"The agent can't skip steps" is doing a lot of work in that sentence. What happens when the plan itself is wrong? Curious whether the approval gate is genuinely blocking or if teams end up rubber-stamping to avoid being the bottleneck.

Launch HN: Twill.ai (YC S25) – Delegate to cloud agents, get back PRs

Discussion Highlights (16 comments)

Related Discussions