StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)
skysniper
155 points
70 comments
April 01, 2026
Related Discussions
Found 5 related stories in 43.9ms across 3,471 title embeddings via pgvector HNSW
- OpenClaw: The Complete 2026 Deep Dive(Install, Cost, Hardware, Reviews and More) svrbvr · 23 pts · March 30, 2026 · 55% similar
- I'm going to build my own OpenClaw, with blackjack and bun rcarmo · 52 pts · March 11, 2026 · 55% similar
- Show HN: Klaus – OpenClaw on a VM, batteries included robthompson2018 · 138 pts · March 11, 2026 · 53% similar
- Show HN: DenchClaw – Local CRM on Top of OpenClaw kumar_abhirup · 110 pts · March 09, 2026 · 51% similar
- Show HN: OpenClaw-class agents on ESP32 (and the IDE that makes it possible) pycoclaw · 23 pts · March 12, 2026 · 50% similar
Discussion Highlights (15 comments)
skysniper
I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness. The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7. The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance. Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance. Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.
hadlock
According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet. https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F
smallerize
It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.
WhitneyLand
StepFun is an interesting model. If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179
skysniper
another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.
dmazin
why do half the comments here read like ai trying to boost some sort of scam?
grimm8080
Yet when I tried it it did absymal compared to Gemini 2.5 Flash
sunaookami
Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.
mgw
Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash. Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.
grigio
i like StepFun 3.5 Flash, a good tradeoff
yieldcrv
people aren't just using Claude models any more? that's nice to see
james2doyle
None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too. I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well
ipython
I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters. Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on) The task was: > Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions. Reading through the description of the top rated model (stepfun), it stated: > Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task. Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence: > Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task. So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit. Ok, closed that tab.
azmenak
This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.
clausewitz
I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.