Teaching Claude Why

pretext 127 points 63 comments May 08, 2026

Discussion Highlights (8 comments)

soletta

This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.

bicx

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

roenxi

One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch. For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with. [0] https://github.com/p-e-w/heretic

justonepost2

If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned? If the answer is “yes”, our definition of alignment kind of sucks.

unchocked

This lowers p(doom) for me. It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations. Probably also illuminates moral interpretability.

zozbot234

Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!

datadrivenangel

Why do they have cancer research listed on these charts as a misalignment issue?

siva7

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

Teaching Claude Why

Discussion Highlights (8 comments)

Related Discussions