StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

skysniper 155 points 70 comments April 01, 2026

Discussion Highlights (15 comments)

skysniper

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness. The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7. The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance. Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance. Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.

hadlock

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet. https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

smallerize

It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

WhitneyLand

StepFun is an interesting model. If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179

skysniper

another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.

dmazin

why do half the comments here read like ai trying to boost some sort of scam?

grimm8080

Yet when I tried it it did absymal compared to Gemini 2.5 Flash

sunaookami

Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.

mgw

Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash. Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.

grigio

i like StepFun 3.5 Flash, a good tradeoff

yieldcrv

people aren't just using Claude models any more? that's nice to see

james2doyle

None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too. I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well

ipython

I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters. Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on) The task was: > Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions. Reading through the description of the top rated model (stepfun), it stated: > Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task. Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence: > Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task. So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit. Ok, closed that tab.

azmenak

This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.

clausewitz

I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.

StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

Discussion Highlights (15 comments)

Related Discussions