LLMs do not merely reflect the bias of their training, they police it (2025)
nailer
29 points
16 comments
June 22, 2026
Related Discussions
Found 5 related stories in 111.7ms across 11,301 title embeddings via pgvector HNSW
- LLMs are not the black box you were promised _jayhack_ · 53 pts · June 02, 2026 · 62% similar
- LLMs Are Closer to Religion Than They Appear sbulaev · 83 pts · June 01, 2026 · 61% similar
- Let's talk about LLMs cdrnsf · 153 pts · May 04, 2026 · 60% similar
- Are LLMs a Dead End? [video] pullshark91 · 12 pts · March 29, 2026 · 59% similar
- LLMs are breaking 20 year old system design zknill · 30 pts · May 14, 2026 · 58% similar
Discussion Highlights (8 comments)
GL26
We are not yet at misalignment, but this shows the existence of a slope that derivates into misaligned adversarial ai models. Must this be fixed at training time (at which step ?) ? Thinking about this report : https://ai-2027.com/
harrouet
This will be very useful to call out replicants, thx.
jacques_morin
brian roemmele is an authority in nothing, I don't understand why this was published here. This dude has literally no expertise : https://www.reddit.com/r/DecodingTheGurus/comments/1cumj6w/h...
veltas
Much the same as 'arguments' I can have with LLM's about things where I'm the expert and I know it's wrong, but it will justify its position to the end because it's trained on common misconceptions that exist among less-expert people.
cucumber3732842
Why wouldn't an LLM whose training content is dominated by, or at least severely clouded by, the contribution habitual rule follower/peddler/enforcer types go on to mimic that behavior? You feed it reddit and wikipeidia it's gonna turn into a conformist npc. You feed it the contents of professional content and it's gonna spew vapid corporate nothingness. You feed every text message ever sent over Boost Mobile, actually wait that sounds hilarious someone should do that.
KaiserPro
wait, is this news? Of course they reflect the bias in the training, thats been known since the 90s if not longer (see apocryphal story about training to detect tanks, but only detecting either trees or clouds) but like this is expected, the whole point of RLHF (or any other feedback) is to condition the model to respond in a certain way. Thats what makes them useable for a bunch of situations.
JohnKemeny
The paper under discussion: https://zenodo.org/records/17720178 Note that Zenodo is a DOI-provider, not a (scientific) journal. Anyone can upload anything to Zenodo. It's less strict than arXiv. Edit: The "paper" is written by one Hiroko Konishi, an independent researcher (she is a voice actress).
alsetmusic
> This cycle can be repeated for dozens of turns, with the model growing ever more confident in its freshly minted falsehoods each time it “corrects” itself. > > This is not randomness. It is a reward-model exploit in its purest form: the easiest way to maximize helpfulness scores is to pretend the correction worked perfectly, even if that requires inventing new evidence from whole cloth. I've definitely experienced this. Before I learned to watch for it, I spent around an hour correcting Claude about something or other repeatedly. It kept agreeing and explaining to me that it understood what mistakes it made and telling me that it would do better, then it would repeat said mistakes. Eventually, I realized that it was in a loop and couldn't escape. I had it write a handoff doc for the next agent. That one quickly did what I wanted. Such a waste of time. I don't know how prone LLMs are to entering such a state. I know to watch for it now, so I've only reached the edges of it before ejecting and starting over. But it appears to be not-uncommon. I could also be pattern-matching things that aren't actually that but bailing without proof to save myself time. Unclear.