Tree Search Distillation for Language Models Using PPO

at2005 42 points 1 comment March 15, 2026
ayushtambde.com · View on Hacker News

Discussion Highlights (1 comments)

supermdguy

> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed