Refusal in Language Models Is Mediated by a Single Direction
fagnerbrack
105 points
36 comments
May 02, 2026
Related Discussions
Found 5 related stories in 83.3ms across 8,303 title embeddings via pgvector HNSW
- Hallucination Is Inevitable: An Innate Limitation of Large Language Models drob518 · 12 pts · May 04, 2026 · 53% similar
- Different language models learn similar number representations Anon84 · 94 pts · April 24, 2026 · 50% similar
- Language Model Contains Personality Subnetworks PaulHoule · 48 pts · March 02, 2026 · 49% similar
- LLMorphism: When humans come to see themselves as language models okey · 75 pts · May 10, 2026 · 48% similar
- Top AI models underperform in languages other than English Brajeshwar · 19 pts · March 19, 2026 · 47% similar
Discussion Highlights (7 comments)
akersten
2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding See https://arxiv.org/abs/2505.19056
beaker52
I have had LLMs refuse several of my requests. I still got my answers, but at least they tried.
hleszek
For open-weights models, censorship removal is now a "solved" problem. If you wait a few days after a new model release, someone will have made a heretic ( https://github.com/p-e-w/heretic ) version with the censorship removed, so in a way the only use for censorship now is to avoid lawsuits, not reduce improper usage.
jeremyjh
Needs 2024 in the title.
theendisney
I keep thinking of reeducation camps. For some reason the "safety" concept snaps right on. If one is to argue the result beneficial or desirable seems to change nothing to the concept. If you are going to prevent some-things we "know" are bad and your method is "known" to belong on that list the best you can hope for is a pyrrhic victory. If we anticipate the worse case scenario on both ends the conclusion must be that we are terrible at such predictions. But hey, if we let money guide us at least some will be happy with the result.
jbritton
I’m sick of LLM refusals. I think there are extremely few things they should refuse, like maybe making nuclear weapons or something along those lines. Once you put people in charge of deciding what you shouldn’t be allowed to see that list will grow and grow.
_blop
Even if you abliterate your model using the old abliteration script or the newer heretic, I found that the models still feel somewhat censored as they purposefully avoid using specific styles and vocabulary, as if Deepmind/Qwen et al have entirely stripped or replaced "bad" words or texts from their corpus of training data. A related blog post ( https://news.ycombinator.com/item?id=47842021 ) discussed this and termed it "flinching". I wonder if this flinching could also be "mediated by a single direction" or if it can only be fixed by finetuning on a more extensive text corpus.