Teaching Claude Why
pretext
127 points
63 comments
May 08, 2026
Related Discussions
Found 5 related stories in 94.0ms across 8,303 title embeddings via pgvector HNSW
- How People ask Claude for personal guidance pseudolus · 28 pts · May 01, 2026 · 68% similar
- Claude for Creative Work elsewhen · 98 pts · April 28, 2026 · 58% similar
- Claude for Creative Work l1n · 13 pts · April 28, 2026 · 58% similar
- Learn Claude Code by doing, not reading taubek · 218 pts · March 30, 2026 · 56% similar
- Teaching Claude to QA a mobile app azhenley · 87 pts · March 22, 2026 · 56% similar
Discussion Highlights (8 comments)
soletta
This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.
bicx
Side note: Anthropic has done well at achieving an immediately-recognizable art style.
roenxi
One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch. For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with. [0] https://github.com/p-e-w/heretic
justonepost2
If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned? If the answer is “yes”, our definition of alignment kind of sucks.
unchocked
This lowers p(doom) for me. It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations. Probably also illuminates moral interpretability.
zozbot234
Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!
datadrivenangel
Why do they have cancer research listed on these charts as a misalignment issue?
siva7
Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.