Computer Use is 45x more expensive than structured APIs

palashawas 366 points 212 comments May 05, 2026

Discussion Highlights (20 comments)

taormina

The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.

sudb

I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.) There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer. 1. https://github.com/SawyerHood/dev-browser

cjbarber

I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.

svnt

> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything. > To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own. This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here? Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.

aurareturn

In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly. I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

moralestapia

This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.

Havoc

Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority. I can see the appeal in pixel route given universality but wow that seems ugly on efficiency

_boffin_

What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after. idk.. not really thought out too much, but has to be better

faangguyindia

I saw Codex was screenshotting, then clicking around. I just stopped it and never used that again. Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users. I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc... I developed all this on $20 codex sub.

dist-epoch

It doesn't matter. Electron uses 10x more RAM than regular apps. But it's so convenient. Python is 100x slower than C. It's in the top 3 of languages now. Worse but more convenient always wins.

gowld

Confusing title? "Computer Use" is actually "Browser vision"?

antves

I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality) At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be

overgard

I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.) The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of: - Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild. - Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this. - Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this. Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.

rootcage

The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.

merlindru

I'm building something that fixes this exact problem[1]. The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility. The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab` Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful. [1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet

janalsncm

Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds. The only reason you wouldn’t choose an API is if it wasn’t viable.

jacktu

Totally agree. I’ve been building an AI visual tool recently and experimented with both approaches. The latency and c ost of generic "agentic" browser use are absolute dealbreakers for real-time consumer apps right now. Structured APIs (even just chained LLM calls with strict JSON schemas) are not only 40x cheaper, but more importantly, they are deterministic enough to actually build a stable product on top of. Computer use is an amazing demo, but structured APIs are what pay the server bills.

zephen

I find this extremely surprising. When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.

2001zhaozhao

I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind. I don't think any new app should ever be specifically designed for AI to interact with them through computer use

ai_fry_ur_brain

Its funny watching the slow mean reversion back to more deterministic tooling.

Computer Use is 45x more expensive than structured APIs

Discussion Highlights (20 comments)

Related Discussions