The strongest agent finished under 4% of real work last week

A benchmark dropped last week, showing that the strongest AI agent completes under 4% of real workflows end-to-end. The same week, every major vendor repriced their software as if agents were already running your back office. Today's issue is to read about what that gap means for every agent contract you evaluate this quarter.

In today’s issue:

Main story: The strongest agent finished under 4% of real work last week
AI Headlines since Friday

Last Friday, the CEO of Mintlify, a documentation startup used by Anthropic, Cursor, and Resend, announced that the company was eliminating seat-based pricing entirely. The reason he gave: agents, not humans, are now the primary users of the software, and seats no longer describe who is doing the work.

The same week, Greg Brockman, president of OpenAI, posted "run codex on every commit" as a recommended default for software teams. OpenAI stood up a deployment company and acquired Tomoro to move models into production environments. UiPath spent its developer conference repositioning the entire RPA category around coding agents. And xAI launched Grok Build as a direct challenger to Claude Code.

If you spent the week reading vendor communication, you would walk away convinced that agents are roughly two product cycles from running large portions of your back office. Then a benchmark dropped.

Setup

SaaS-Bench, released on May 16 by researchers from UniPat AI, Peking University, and the University of Hong Kong, evaluates whether the current generation of computer-using agents can complete real professional workflows inside actual SaaS environments. The benchmark deliberately avoids toy environments and synthetic tasks. It uses 23 deployable SaaS systems across six professional domains (CRM, finance, operations, customer support, and two others) and runs 106 long-horizon tasks that average more than 100 interaction steps. These are the kinds of workflows a CRM admin, an FP&A analyst, or a customer support manager actually does in a day.

The headline result: the strongest agent the researchers tested completed fewer than 4% of tasks end-to-end, not the 40% or 14% the vendor marketing would imply. The paper attributes the failures to planning breakdowns, state-tracking errors, loss of context across applications, and an almost total inability to recover from mistakes.

This is the same week the vendors moved their pricing, their tooling, and their go-to-market motions onto the assumption that agents are the new user.

The turn

The headline of this is that the vendors are wrong, the benchmark is right, and the agent story is a bubble that needs another year. I don't think that's the read.

The vendors are not wrong about the direction. They are wrong about the pace, and the gap between where they are pricing and where the work actually is creates a problem for you specifically, because you are the buyer absorbing the difference.

Vendors price ahead of capability. Buyers absorb the gap.

This is not new behavior. Salesforce sold Einstein as an AI layer years before any operator I have worked with could actually point to a workflow it owned end-to-end. Microsoft sold Copilot per seat for eighteen months before anyone could explain to a CFO what a Copilot seat returned. The pattern is the same: a vendor reads a real signal in the research, decides the product needs to move now, and prices the product against the version of itself it expects to ship in twelve to eighteen months.

When the product matches the pricing, everybody is happy. When it doesn't, the customer pays for the gap.

The reason this week is different is the size of the move. Seat-based pricing was the dominant model for B2B software for two decades. Mintlify abandoning it is small on its own. Mintlify abandoning it in the same week that OpenAI, xAI, and UiPath all repositioned around agent-primary workflows is a category move. When category moves happen, the buying contracts that get signed in the next six months will assume the capability the benchmark says is not there yet.

The agents that work are doing one job, not your job.

Cognition's Devin is reportedly at $445M in annualized revenue. Anthropic has narrowly passed OpenAI in U.S. business adoption, reportedly 34.4% to 32.3% of paid business subscriptions. Cursor with Claude Opus 4.7 leads the new Artificial Analysis Coding Agent Index. The agents that are working commercially right now are doing one thing well: writing code under heavy human supervision, inside an environment built for them, with a developer reviewing every meaningful step.

That is a real product. It is not the product that the pricing implies. The pricing implies an agent that can pick up a CRM workflow with a hundred steps, coordinate it across three applications, recover from its own errors, and finish the job. The SaaS-Bench number says that the product does not exist yet. The product that exists is closer to a very fast junior developer who works only on the code review they are handed and stops when the test fails.

The real cost is not the contract; it is the supervision tax.

Per-seat pricing made AI procurement easy: you counted headcount and multiplied. Agent-priced software does not have that anchor. Vendors are starting to price on tasks completed, tokens consumed, outcomes delivered, or some hybrid, and when the agent finishes under 4% of long-horizon tasks without help, the real cost is the engineer or analyst who watches the agent, catches the failures, fixes the state, and re-runs the job.

The companies I have worked with that are getting real value out of coding agents’ budget for this explicitly. They assign a "supervision FTE" to every agent in production, sometimes one person watches two or three agents. The companies that don't budget for it find out the hard way, when the agent loses state in the middle of a multi-system workflow, and a Tuesday becomes a fire drill. When you read an agent vendor's pricing page in the next ninety days, the question to bring is not "is this cheaper than a seat," it is "what does it cost me to supervise this thing well enough that the savings are real."

The agent layer is being built on top of a CRM that does not know about it.

The SaaS-Bench failure modes are revealing in one specific way: the agents fall over hardest on cross-application coordination. A real CRM task touches the email tool, the calendar, the ticketing system, the invoicing tool, and sometimes a spreadsheet, and each of those systems has its own state, its own permissions, and no shared memory of what the agent is trying to do. The architecture problem is going to take years to solve, and it is going to be solved at the platform layer (Salesforce, Microsoft, ServiceNow, the SAPs of the world), not at the agent layer.

For a mid-market operator, the practical version of this is that the agents that will work in your company first are the ones doing single-system, well-bounded jobs (a documentation agent inside Mintlify, a coding agent inside Cursor, a support agent inside one ticketing system), not the ones being sold this week as horizontal coordinators across your entire back office.

AI READY PRO · FREE

A 30-day AI program. In your inbox. Then it ends.

One email every morning at 7 am ET. Each one is a short read and one real thing to try before lunch. By Day 30, you have nine concrete capabilities, including prompting that hits your bar on the first draft, AI workflows that take five hours a week off your plate, and the language to lead the AI conversation at your company.

Built for ops leaders, COOs, chiefs of staff, founders, and team leads at mid-market companies. The four weeks move you from foundations (effective prompting, catching hallucinations, learning AI with AI) to connect (a personal context layer, daily automation, five hours a week back), to build (reusable skills, agent design and debugging), to lead (tool and model evaluation, leading the conversation at work). Twenty-one mornings, no upsells, no community Slack.

Counterargument

The strongest objection is that benchmarks have systematically underestimated AI for the last three years. GPT-4 was supposed to be the ceiling and was not. The Devin demo was supposed to be smoke and mirrors and is now generating $445M a year. Every time someone publishes a benchmark saying "AI can only do X percent of Y," capability tends to catch up within a year, and the benchmark looks slow.

I want to be honest about this. It is possible the SaaS-Bench number is at 4% today and 35% in nine months, and the history of this market says capability arrives faster than benchmarks predict. But two things matter for the operator's decision in front of you right now. One, your contract is being signed today, not in nine months, and the supervision tax is real today. Two, even if the model number jumps fast, the architectural problem (cross-system state, permissions, audit) does not move at the same speed as the model, because the model can get smarter inside a week, and your CRM cannot.

What this means for you

The vendor messaging this week was not wrong about the direction, so treat it as a forward-looking signal rather than a current-state description.

On every agent contract you evaluate this quarter, ask the vendor in writing what their reference customers have staffed against the agent. If they cannot answer with several people and a job title, the deployment is earlier than the sales deck implies. When a vendor moves off per-seat pricing, ask what they are pricing on instead, and if the answer is "outcomes," ask which outcomes count as completed and who decides. For a documentation product like Mintlify, outcomes can be defined cleanly; for a CRM or an ERP, outcomes are a fight worth having before you sign.

The benchmarks that matter most for your operation are not the ones from the model labs; they are the ones from the platforms that own your underlying system of record. When Salesforce, Microsoft, and ServiceNow publish agent benchmarks against their own products (and they will), those are the numbers worth reading, because they tell you what the model can do inside the architecture you actually run.

From the field

I have spent the last several weeks in conversations with mid-market leaders about agents, and there is a pattern that keeps coming up. The smartest operators in those conversations are not asking "where can we deploy an agent." They are asking, "Where do we have a process that is bounded enough to survive one?" That is a quieter question, and a better one.

The unspoken fear underneath these conversations is the one I wrote about on LinkedIn this morning: that AI gets bought, gets installed, and gets quietly stranded. Tools sitting on a shelf are one version of that fear. An agent contract sized for a workflow the agent can't finish is the same fear in a different shape. A 4% completion rate doesn't tell you to wait. It tells you to be specific. Pick the workflow where you can describe the supervision arrangement in one sentence and the rollback in two. Then sign.

Hire secure AI teammates that work 24/7.

Hire pre-built AI teammates. Give your engineers and operators a platform to ship their own AI apps. Stop losing sleep about what is running where.

Clutch is the platform behind both: pre-built agents for the workflows your ops team should automate first, plus the integration plane your team's vibe-coded apps and Claude Code projects plug into. One platform. Real production. Visible and safe by default.

Built for ops, engineering, and security teams that are tired of the shadow-AI surface area inside their own company.

SINCE FRIDAY

A new SaaS benchmark found that the strongest computer-using agent completes under 4% of real workflows end-to-end. The most important counterweight to this week's agent-pricing announcements. The gap between what agents can do in demos and what they finish in production is still wide.
Mintlify eliminated seat-based pricing and said agents, not humans, are now the primary users of its software. First well-known SaaS company to formally retire seats. The question for every renewal conversation this year is whether the pricing or the capability arrives first.
OpenAI launched a Deployment Company and acquired UK consultancy Tomoro on day one. 150 forward-deployed engineers, McKinsey, Bain, and Capgemini as partners, $4B+ committed. OpenAI is openly building the consulting business it used to leave to Accenture and Deloitte.
Anthropic narrowly passed OpenAI in US business adoption for the first time. 34.4% to 32.3% of paid business subscriptions per the Ramp AI Index. First time Anthropic has led any major enterprise scoreboard. Worth watching whether it holds after the June billing split.
Artificial Analysis launched a Coding Agent Index that ranks complete systems, not just models. Cursor plus Claude Opus 4.7 leads the composite score. The cost variance across configurations is the part most vendor pricing decks leave out.
Greg Brockman posted "run Codex on every commit" as a recommended default for engineering teams. OpenAI is positioning Codex as standard infrastructure rather than an optional tool. That framing has procurement implications worth flagging before your next engineering tooling review.

REPLY

Hit reply and tell me, of the agent-priced products your team is currently evaluating, which one has the clearest "this is what success looks like" definition, because the unclear ones are usually the ones worth walking away from this quarter.

FORWARD

If a colleague is sitting in a conversation this week about whether to renew a per-seat contract or move to an agent-priced version of the same product, forward this issue. The supervision-tax question is the one most of those conversations are missing.

IF YOU GOT VALUE FROM THIS ISSUE

Keep going. The next 30 mornings, in your inbox.

The analysis you just read came out of the same operator practice that the 30 Days program teaches. Foundations, then connect, then build, then lead. By Day 30, you own nine concrete capabilities: prompting that hits your bar on the first draft, AI workflows that take five hours a week off your plate, agent skills your team can run, and the language to lead the AI conversation at your company.

Free. Thirty mornings, then it ends. No upsells, no community Slack. Inspired by Hilary Gridley's Couch to 5K for AI.