Universal Claude.md – cut Claude output tokens

killme2008 231 points 88 comments March 31, 2026

Discussion Highlights (20 comments)

yieldcrv

> Note: most Claude costs come from input tokens, not output. This file targets output behavior so everyone, that means your agents, skills and mcp servers will still take up everything

rcleveng

While I love this set of prompts, I’ve not seen my clause opus 4.6 give such verbose responses when using Claude code. Is this intended for use outside of Claude code?

btown

It seems the benchmarks here are heavily biased towards single-shot explanatory tasks, not agentic loops where code is generated: https://github.com/drona23/claude-token-efficient/blob/main/... And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions? The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce." By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.

sillysaurusx

> the file loads into context on every message, so on low-output exchanges it is a net token increase Isn’t this what Claude’s personalization setting is for? It’s globally-on. I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality. See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue > Two things that helped me stay under [the token limit] even with heavy usage: > Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom > RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window. > Stacks on top of Headroom. https://github.com/rtk-ai/rtk > MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt. > That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack > All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads. I haven’t tested those yet, but they seem related and interesting.

Tostino

You have a benchmark for output token reduction, but without comparing before/after performance on some standard LLM benchmark to see if the instructions hurt intelligence. Telling the model to only do post-hoc reasoning is an interesting choice, and may not play well with all models.

notyourav

It boggles my mind that an LLM "understands" and acts accordingly to these given instructions. I'm using this everyday and 1-shot working code is now a normal expectation but man, still very very hard to believe what LLMs achieved.

andai

I told mine to remove all unnecessary words from a sentence and talk like caveman, which should result in another 50% savings ;)

johnwheeler

That's what I call a feature wishlist.

cheriot

I get where the authors are coming from with these: https://github.com/drona23/claude-token-efficient/blob/main/... But I'd rather use the "instruction budget" on the task at hand. Some, like the Code Output section, can fit a code review skill.

monooso

Paul Kinlan published a blog post a couple of days ago [1] with some interesting data, that show output tokens only account for 4% of token usage. It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine): > Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens . It’s almost entirely input. [1]: https://aifoc.us/the-token-salary/

joshstrange

As with all of these cure-alls, I'm wary. Mostly I'm wary because I anticipate the developer will lose interest in very little time and also because it will just get subsumed into CC at some point if it actually works. It might take longer but changing my workflow every few days for the new thing that's going to reduce MCP usage, replace it, compress it, etc is way too disruptive. I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.

danpasca

I might be wrong but based on the videos I've watched from Karpathy, this would, generally, make the model worse. I'm thinking of the math examples (why can't chatGPT do math?) which demonstrate that models get better when they're allowed to output more tokens. So be aware I guess.

xianshou

From the file: "Answer is always line 1. Reasoning comes after, never before." LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.

foxes

>the honest trade off Is this like a subtle joke or did they ask claude to make a readme that makes claude better and say >be critical and just dump it on github

keyle

Amusing how this industry went from tweaking code for the best results, to tweaking code generators for the best results. There doesn't seem to be any adults left in the room.

miguel_martin

Is there a "universal AGENTS.md" for minimal code & documentation outputs? I find all coding agents to be verbose, even with explicit instructions to reduce verbosity.

empressplay

That output is there for a reason. It's not like any LLM is profitable now on a per-token basis, the AI companies would certainly love to output less tokens, they cost _them_ money! The entire hypothesis for doing this is somewhat dubious.

motoboi

Things like this make me sad because they make obvious that most people don’t understand a bit about how LLM work. The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive. Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability. So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.

nvch

The author offers to permanently put 400 words into the context to save 55-90 in T1-T3 benchmarks. Considering the 1:5 (input:output) token cost ratio, this could increase total spending. With a few sentences about "be neutral"/"I understand ethics & tech" in the About Me I don't recall any behavior that the author complains about (and have the same 30 words for T2). (If I were Claude, I would despise a human who wrote this prompt.)

brikym

Can Anthropic kindly fuck off with their ADVERT.md already. It's AGENTS.md Sent from my iPhone

Universal Claude.md – cut Claude output tokens

Discussion Highlights (20 comments)

Related Discussions