Claude mixes up who said what

sixhobbits 420 points 328 comments April 09, 2026

Discussion Highlights (20 comments)

RugnirViking

terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.

lelandfe

In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt . I suspect this sort of problem exists widely in AI.

Latty

Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees. It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.

Shywim

The statement that current AI are "juniors" that need to be checked and managed still holds true. It is a tool based on probabilities . If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you. Like with juniors, you can vent on online forums, but ultimately you removed all the fool's guard you got and what they did has been done.

rvz

What do you mean that's not OK? It's "AGI" because humans do it too and we mix up names and who said what as well. /s

__alexs

Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?

stuartjohnson12

one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks

perching_aix

Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least. I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.

AJRF

I imagine you could fix this by running a speaker diarization classifier periodically? https://www.assemblyai.com/blog/what-is-speaker-diarization-...

xg15

> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.” Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace. I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)

awesome_dude

AI is still a token matching engine - it has ZERO understanding of what those tokens mean It's doing a damned good job at putting tokens together, but to put it into context that a lot of people will likely understand - it's still a correlation tool, not a causation. That's why I like it for "search" it's brilliant for finding sets of tokens that belong with the tokens I have provided it. PS. I use the term token here not as the currency by which a payment is determined, but the tokenisation of the words, letters, paragraphs, novels being provided to and by the LLMs

4ndrewl

It is OK, these are not people they are bullshit machines and this is just a classic example of it. "In philosophy and psychology of cognition, the term "bullshit" is sometimes used to specifically refer to statements produced without particular concern for truth, clarity, or meaning , distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth" - https://en.wikipedia.org/wiki/Bullshit

nicce

I have also noticed the same with Gemini. Maybe it is a wider problem.

cyanydeez

human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts. LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output. so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given. lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.

supernes

> after using it for months you get a ‘feel’ for what kind of mistakes it makes Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.

okanat

Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.

voidUpdate

> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash." It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't

Aerolfos

> "Those are related issues, but this ‘who said what’ bug is categorically distinct." Is it? It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that , simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos". So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.

dtagames

There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.

have_faith

It's all roleplay, they're no actors once the tokens hit the model. It has no real concept of "author" for a given substring.

Claude mixes up who said what

Discussion Highlights (20 comments)

Related Discussions