Claude Mythos: The System Card

paulpauper 31 points 22 comments April 13, 2026
thezvi.substack.com · View on Hacker News

Discussion Highlights (10 comments)

skerit

I'll believe in this miracle model when I see it.

lifecodes

the CoT bug where 8% of training runs could see the model's own scratchpad is the scariest part to me. and of course it had to be in the agentic tasks, exactly where you need to trust what the model is "thinking" the sandwich email story is wild too. not evil, just extremely literal. that gap between "we gave it permissions" and "we understood what it would do" feels like the whole problem in one anecdote also the janus point landed, if you build probes to see how the model feels and immediately start deleting the inconvenient ones, you've basically told it honesty isn't safe. that seems like it compounds over time It's scary to think that some very intelligent AI Model is not honest with us.. Ultron is not far, I guess...

hodder

Preview coming out on Bedrock. So not sure this is true any longer. Im awaiting further details. EDIT: AWS said Anthropic’s Claude Mythos is now available through Amazon Bedrock as a gated research preview focused on cybersecurity, with access initially limited to allow listed organizations such as internet-critical companies and open-source maintainers.

giancarlostoro

There's a lot of hype, but I think a lot of us will agree, hype is fine and dandy but if nobody can use it yet, what's the point in building up all the hype? If you build up too much hype and it misses the mark, you will be worse off too.

babblingfish

The "hiding from researchers" framing is particularly bad. The parsimonious explanation for why a model produces different outputs when it detects eval contexts: eval contexts appear differently in the training distribution and the model learned different output patterns for them. No theory of mind required. Occam's razor. The agentic behaviors emerge from optimization pressure plus tool access plus a long context window. Interesting engineering. Not intent. People are falling for yet another Anthropic PR stunt.

zar1048576

I think we are in largely uncharted territory here, especially given the implications. Is Anthropic's approach optimal? Probably not. But given the stakes involved, gating access seems like a reasonable place to start. I'm curious about how gated access actually holds over time, especially given that historically with dual-use capabilities containment tends to erode, whether through leaks, independent rediscovery, or gradual normalization of access.

kherud

LLMs are extremely capable at problem solving. Presumably because you can autonomously learn a lot of it. But can you somehow account for things like long-term maintainability and code quality (whatever that means) or do you always have to rely on either existing high-quality code-bases (pre-training) or human curated datasets? Since you can't really quantify these properties (as opposed to: the problem is either solved or not), does this restrict autonomous improvement in this area? Are there benchmarks that consider this? Could Claude Mythos create an ultra-quality version of Claude Code or would it still produce something similar to earlier models, which are already over-sufficient in individual problem solving capability.

vb-8448

Am I the only that is feeling the "there is no wall" altaman tweet with o3 moment? Not saying anthropic is lying ... but damn, at least a couple of independent reviews would be nice to have.

SirMaster

I thought GPT-3 wasn't released to the public for awhile?

nikolay

It's mostly a marketing hype...

Semantic search powered by Rivestack pgvector
4,562 stories · 42,934 chunks indexed