Show HN: PhAIL – Real-robot benchmark for AI models

vertix 20 points 8 comments March 31, 2026
phail.ai · View on Hacker News

I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know. PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running. Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+. Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions. Happy to answer questions about methodology, the models, or what we observed. [1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model

Discussion Highlights (5 comments)

anna_pozniak

I'm curious! What other models you're planning to add to the leaderboard?

akshaisarathy

If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?

vladimir_gor

I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.

chfritz

This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!

apetrovicheva

This is amazing. Loved watching the videos with real-world attempts. Finally a real benchmark vs polished teleoperated twitter videos. Shows the real state of a super important industry, and there’s a lot of work to do.

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed