Tree Search Distillation for Language Models Using PPO
at2005
42 points
1 comment
March 15, 2026
Related Discussions
Found 5 related stories in 78.0ms across 8,303 title embeddings via pgvector HNSW
- Show HN: I built a tiny LLM to demystify how language models work armanified · 249 pts · April 06, 2026 · 48% similar
- Language model teams as distributed systems jryio · 87 pts · March 16, 2026 · 47% similar
- NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute sdpmas · 147 pts · March 04, 2026 · 45% similar
- Introspective Diffusion Language Models zagwdt · 257 pts · April 14, 2026 · 44% similar
- δ-mem: Efficient Online Memory for Large Language Models 44za12 · 203 pts · May 16, 2026 · 43% similar
Discussion Highlights (1 comments)
supermdguy
> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?