Why LLMs haven't solved the browser yet

If you've watched a polished demo of Claude or ChatGPT's browser agent, you've probably had the same thought we did: this is going to replace half of what ops teams do today.

Then you tried it on something real.

The benchmark gap

Public benchmarks are kinder to agents than production is, and even they tell a sobering story. State-of-the-art browser agents on WebArena and VisualWebArena are still well below 60% task success on tasks designed to be tractable. On real, messy, login-gated enterprise software, internal numbers we've seen are worse — often by a lot.

That's not a critique of the labs. It's a reflection of what the browser actually is: an environment built for humans, by humans, with no expectation that anything other than a human would ever click around in it.

Where generic agents fall down

Three failure modes show up over and over again in the work we do:

Brittleness. A vendor pushes a small CSS change on Tuesday afternoon, and a step that worked yesterday now selects the wrong element. The agent doesn't realize it picked the wrong row until three actions later.
Token economics. Letting a generalist model drive every interaction — read the DOM, decide, click, repeat — is expensive. For a workflow that runs hundreds of times a day, the math stops working quickly.
No audit trail. When something goes wrong, you need to know exactly what the agent saw, what it decided, and why. Most off-the-shelf setups don't give you that.

What works today

The teams getting real ROI from browser automation in 2026 aren't running pure agents. They're running hybrid workflows: deterministic Playwright or Stagehand for the parts of the flow that are stable, a vision or LLM call for the parts that genuinely require judgment, and a clean audit trail around the whole thing.

That's the gap we fill. The agents will get better. Until they do — and probably for years after they do — the boring, hybrid, observable version is what actually ships.

Why LLMs haven't solved the browser yet

The benchmark gap

Where generic agents fall down

What works today

We'd be happy to take a look.