Agents that run while I sleep

aray07 288 points 270 comments March 10, 2026
www.claudecodecamp.com · View on Hacker News

Discussion Highlights (20 comments)

BeetleB

I wish there was a way to "freeze" the tests. I want to write the tests first (or have Claude do it with my review), and then I want to get Claude to change the code to get them to pass - but with confidence that it doesn't edit any of the test files!

RealityVoid

It's... really the same problem when you hire people to just write tests. A lot of time it just confirms that the code does what the code does. Having clear specs of what the code should do make things better and clearer.

Havoc

They're definitely inferior to proper tests, but even weak CC tests on top of CC code is an improvement over no tests. If CC does make a change that shifts something dramatically even a weak test may flag enough to get CC to investigate. Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.

digitalPhonix

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer. That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”

lateforwork

> When Claude writes tests for code Claude just wrote, it's checking its own work. You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.

dzuc

red / green / refactor is a reasonable way through this problem

fragmede

Adversarial AI code gen. Have another AI write the tests, tell Codex that Claude wrote some code and to audit the code and write some tests. Tell Gemini that Codex wrote the tests. Have it audit the tests. Tell Codex that Gemini thinks its code is bad and to do better. (Have Gemini write out why into dobetter.md)

tayo42

I don't think this is right becasue it's talking about Claude like it's a entity in the world. Claude reviewing Claude generated code and framing it like a individual reviewing it's own code isn't the same.

egeozcan

You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works. The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.

jdlshore

Pet peeve: this post misunderstands “TDD.” What it really describes is acceptance tests. TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code. TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.” Thank you for coming to my TED^H^H^H TDD talk.

throwyawayyyy

I am afraid that we are heading to a world in which we simply give up on the idea of correct code as an aspiration to strive for. Of course code has always been bad, and of course good code has never been a goal in the whole startup ecosystem (for perfectly legitimate reasons!). But that real production code, for services that millions or even billions of people rely on, should be reliable, that if it breaks that's a problem, this is the whole _engineering_ part of software engineering. And we can say: if we give that up we're going to have a whole lot more outages, security issues, all those things we are meant to minimize as a profession. And the answer is going to be: so what? We save money overall. And people will get used to software being unreliable; which is to say, people will not have a choice but to get used to it.

afro88

I guess to reach this point you have already decided you don't care what the code looks like. Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code? Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet. One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

bhouston

I call this "Test Theatre" and it is real. I wrote about it last year: https://benhouston3d.com/blog/the-rise-of-test-theater You have to actively work against it.

OsrsNeedsf2P

Our app is a desktop integration and last year we added a local API that could be hit to read and interact with the UI. This unlocked the same thing the author is talking about - the LLM can do real QA - but it's an example of how it can be done even in non-web environments. Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.

seanmcdirmid

I've been doing differential testing in Gemini CLI using sub-agents. The idea is: 1. one agent writes/updates code from the spec 2. one agent writes/updates tests from identified edge cases in the spec. 3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as so they can update their code. (repeat 1 and/or 2 then 3 until all tests pass) Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).

storus

Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

jaggederest

Anyone who wants a more programmatic version of this, check out cucumber / gherkin - very old school regex-to-code plain english kind of system.

monooso

I appear to be in the minority here. Perhaps because I've been practicing TDD for decades, this reads like the blog equivalent of "water is wet."

vidimitrov

He admits the real hole himself: "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass." But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again. The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.

foundatron

Feels like a whole bunch of us are converging on very similar patterns right now. I've been building OctopusGarden ( https://github.com/foundatron/octopusgarden ), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory ( https://factory.strongdm.ai/ ). The autoissue.py script ( https://github.com/foundatron/octopusgarden/blob/main/script... ) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus. On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it. The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently. Also want to shout out Ouroboros ( https://github.com/Q00/ouroboros ), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed