Tree Search Distillation for Language Models Using PPO
at2005
42 points
1 comment
March 15, 2026
Related Discussions
Found 5 related stories in 44.3ms across 3,471 title embeddings via pgvector HNSW
- Language model teams as distributed systems jryio · 87 pts · March 16, 2026 · 47% similar
- NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute sdpmas · 147 pts · March 04, 2026 · 45% similar
- Language Model Contains Personality Subnetworks PaulHoule · 48 pts · March 02, 2026 · 41% similar
- Grove: Distributed ML Training over AirDrop swar_ja · 32 pts · March 25, 2026 · 41% similar
- Top AI models underperform in languages other than English Brajeshwar · 19 pts · March 19, 2026 · 41% similar
Discussion Highlights (1 comments)
supermdguy
> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?