Please do not A/B test my workflow

ramoz 161 points 202 comments March 14, 2026
backnotprop.com · View on Hacker News

Discussion Highlights (20 comments)

Razengan

I knew it: https://news.ycombinator.com/item?id=47274796

reconnecting

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

cebert

This is really frustrating.

handfuloflight

The ToS you agreed to gives Anthropic the right to modify the product at any time to improve it. Did you have your agent explain that to you, or did you assume a $200 subscription meant a frozen product?

cerved

Is the a b test tired to the installation or the user?

onion2k

Section 6.b of the Claude Code terms says they can and will change the product offering from time to time, and I imagine that means on a user segment basis rather than any implied guarantee that everyone gets the same thing. b. Subscription content, features, and services. The content, features, and other services provided as part of your Subscription, and the duration of your Subscription, will be described in the order process. We may change or refresh the content, features, and other services from time to time, and we do not guarantee that any particular piece of content, feature, or other service will always be available through the Services. It's also worth noting that section 3.3 explicitly disallows decompilation of the app. To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law. Always read the terms. :)

phreeza

Seems completely unsurprising?

nemo44x

They lose money at $200/month in most cases. Again, the old rules still apply. You are the product.

Havoc

Moved from CC to opencode a couple months ago because the vibes were not for me. Not bad per se but a bit too locked in and when I was looking at the raw prompts it was sending down the wire it was also quite lets call it "opinionated". Plus things like not being able to control where the websearches go. That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.

krisbolton

The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

himata4113

I have noticed opus doing A/B testing since the performance varies greatly. While looking for jailbreaks I have discovered that if you put a neurotoxin chemical composition into your system prompt it will default to a specific variant of the model presumeably due to triggering some kind of safety. Might put you on a watchlist so ymmv.

rusakov-field

On one side I am frustrated with LLMs because they derail you by throwing grammatically correct bullshit and hallucinations at you, where if you slip and entertain some of it momentarily it might slow you down. But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process. But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

helsinkiandrew

Presumably Anthropic has to make lots of choices on how much processing each stage of Claude Code uses - if they maxed everything out, they'd make more of a loss/less of a profit on each user - $200/month would cost $400/month. Doing A/B tests on each part of the process to see where to draw the line (perhaps based on task and user) would seem a better way of doing it than arbitrarily choosing a limit.

dep_b

I think stable API versions are going to be really big. I’d rather have known bugs u can work around than waking up to whatever thing got fixed that made another thing behave differently.

bushido

I have no issues with A/B tests. I do have an issue with the plan mode. And nine out of ten times, it is objectively terrible. The only benefit I've seen in the past from using plan mode is it remembers more information between compactions as compared to the vanilla - non-agent team workflow. Interestingly, though, if you ask it to maintain a running document of what you're discussing in a markdown file and make it create an evergreen task at the top of its todo list which references the markdown file and instructs itself to read it on every compaction, you get much better results.

shawnz

While I agree with the sentiment here, you might be interested to see that there are a couple hack approaches to override Claude Code feature flags: https://github.com/anthropics/claude-code/issues/21874#issue... https://gist.github.com/gastonmorixe/9c596b6de1095b6bd3b746c...

terralumen

Curious what the A/B test actually changed -- the article mentions tool confirmation dialogs behaving inconsistently, which lines up with what I noticed last week. Would be nice if Anthropic published a changelog or at least flagged when behavior is being tested.

pshirshov

> I pay $200/month for Claude Code Which is still very cheap. There are other options, local Qwen 3.5 35b + claude code cli is, in my opinion, comparable in quality with Sonnet 4..4.5 - and without a/b tests!

letier

They do show me “how satisfied are you with claude code today?” regularly, which can be seen as a hint. I did opt out of helping to improve claude after all.

gnfargbl

For anyone else wondering why the article ends in a non-sequitur: it looks like the author wrote about decompiling the Claude Code binaries and (presumably) discovering A/B testing paths in the code. HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed