Show HN: PhAIL – Real-robot benchmark for AI models
I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know. PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running. Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+. Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions. Happy to answer questions about methodology, the models, or what we observed. [1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model
Discussion Highlights (5 comments)
anna_pozniak
I'm curious! What other models you're planning to add to the leaderboard?
akshaisarathy
If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?
vladimir_gor
I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.
chfritz
This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!
apetrovicheva
This is amazing. Loved watching the videos with real-world attempts. Finally a real benchmark vs polished teleoperated twitter videos. Shows the real state of a super important industry, and there’s a lot of work to do.