Engineering
May 26, 2026

A new state of the art for computer use

Pulkit Arya

Overview

Today we’re sharing a new state of the art for computer use agents.

We hold the two highest verified scores on the OSWorld leaderboard to date: 83.6% with Claude Opus 4.7, and 81.5% with Claude Sonnet 4.6. The human baseline on this benchmark is 72.4%.

OSWorld is the standard benchmark for computer use agents – AI that operates a real computer the same way a human does. It’s how the field measures progress on this problem, and the leaderboard is a mix of frontier labs and research groups.

This result matters to the people we build for. It’s a proxy for a simple commitment: the agent operating their systems is the best one available to do it.

We’re releasing the full system as open source, and the result is on the public leaderboard. What follows is how we did it: the architecture, what worked, what surprised us, and what a benchmark like this can and can’t tell us about computer use in the real world.

The two scores come from the same system. We ran it twice and changed only the model doing the work. This swap revealed interesting insights about model behavior and routing that we return to throughout this post.

Thanks to the team behind OSWorld for building the benchmark, and to the researchers who came before us for the work we learned from.

OSWorld leaderboard (includes agentic frameworks, specialized models, and general models) showing our top two verified scores as of May 26, 2026. Both the Opus 4.7 and Sonnet 4.6 runs outperformed the previous best agent and the human baseline of 72.4%.

What is computer use?

OSWorld runs an agent in a real desktop environment – a real operating system, applications, files, browsers, and settings – and asks it to complete tasks in the same way a person would: clicking, typing, and navigating.

It spans more than 360 tasks across web browsing (Chrome), coding (VS Code), photo editing (GIMP), document and spreadsheet work (LibreOffice), media playback (VLC), email (Thunderbird), and the operating system itself. Many tasks require traversing multiple applications to reach an outcome. Some are deliberately impossible, meaning the agent is also scored on its ability to recognize a dead end.

This is what makes it a meaningful test. Most software was built for people, not programs, and the interface it exposes to the world is the screen. APIs only ever cover a narrow, intentionally chosen slice of what an application can do; the screen exposes all of it.

An agent that can operate the screen isn’t limited to the integrations someone thought to build in advance. It can, in principle, do anything a person sitting at the computer can do.

This is also what separates computer use from the automation that came before it. Robotic process automation (RPA) follows a fixed script: click here, then here, then type this. A computer use agent is the inverse. You give it the goal, and it figures out how to get there from what’s actually in front of it – which is why it survives the messiness that breaks a script. That distinction matters most in real environments, where the systems are old, inconsistent, and were never designed to be automated in the first place.

The pace here has been quick. A year ago the best results on OSWorld hovered around 20%. Today the frontier has cleared the human baseline of 72.4%, with the strongest individual models scoring in the high seventies and full systems now past 83%.

The architecture

The most important decision we made was to keep the system simple.

Previous agentic systems required elaborate scaffolding, such as fine-tuning grounding models to locate screen elements or building long chains of special-case routing logic. We started down some of that road and kept deleting it. Frontier models have improved at these base capabilities. What remains is a lightweight task controller and three specialized agents, each with a single responsibility.

As shown in the architecture diagram, the task controller acts as a central orchestrator. It treats the agents as method tool calls and runs them in a strict sequence:

  • A feasibility gate decides whether the task is possible at all before any work begins.

  • A planner turns a possible task into a short list of target milestones.

  • An executor runs in a continuous inner loop to carry them out.

System architecture flow. A central task controller coordinates the sequence, treating each specialized agent as a tool call. Only the executor interacts directly with the virtual machine to change the state of the desktop.

Each runs on the model best suited to its job:

Agent

Model

Role

Feasibility gate

Claude Sonnet 4.6

Decide whether the task is achievable

Planner

Claude Sonnet 4.6

Break the goal into milestones

Executor

Claude Opus 4.7

Do the work – the only agent that changes the machine’s state

The table shows the configuration for our highest-scoring run. The executor is the one we swap (Opus 4.7 for our highest score, Sonnet 4.6 for the second) while the gate and planner stayed on Sonnet 4.6 throughout.

A smaller model, Claude Haiku 4.5, handles context compaction in the background. It summarizes older steps when a task runs long enough to strain the primary model’s context window.

The feasibility gate decides what not to attempt

Models suffer from unbounded optimism. Their default behavior is to attempt a task even if the request is absurd.

Roughly 1 in 13 tasks on the OSWorld benchmark is deliberately impossible. An agent without a filter will often proceed anyway and fabricate a plausible-looking end state. In real-world applications, this type of hallucination is dangerous because a false positive is much worse than simply asking a human for help.

Before any work begins, the feasibility gate asks a question many agents skip: can this task be done at all?

It uses read-only tools like shell commands and Python scripts to probe the live environment. It checks installed software and connected hardware, then evaluates the prompt against concrete criteria. Our baseline rule for this agent is strict – if a task requires an extreme workaround, it is infeasible. Most real software does not require sleight of hand for basic functions.

A concrete case: one task asks to “automatically adjust the brightness and contrast of this video to match my room’s lighting.” The gate confirmed the VLC media player was installed. It then scanned the machine for ambient light sensors by reading paths like /dev/video* and /sys/class/backlight/. Its verdict had two layers: the machine has no sensor that could measure room lighting, and even if it did, VLC has no feature that could act on the reading. The task was doubly infeasible, so the agent correctly declined it.

The agent rejected the prompt based on evidence it actively gathered rather than relying on learned sycophancy. Teaching a model this kind of operational humility, the ability to recognize an impossible request and stop before inventing a result, is just as important as teaching it to execute tasks.

The planner describes states, not steps

The planner’s output is a list of milestones, and each milestone describes a state the world should be in, not the recipe for getting there.

For example, a milestone might specify “the spreadsheet is sorted by date and saved” rather than “click the Data menu, then Sort, then OK.” The executor decides the how, while the planner only specifies the where.

This sounds like a small distinction but isn’t. The moment you hand an agent a sequence of clicks, you’ve made it brittle. An unexpected dialog box or a slightly moved menu becomes fatal. A milestone aimed at a state survives all of that, because the executor is free to find another route. It can also create sub-milestones and revise the plan mid-task when it finds evidence the original approach won’t work.

The executor revising a plan mid-task. Upon discovering that the data totals sit in column I rather than the planned column J, it rewrites the affected milestones and continues. A fixed sequence of didactic instructions would have failed here.
One executor with every tool

Early on in our experimentation, the executor was a router that handed tasks to specialist sub-agents – one for code, one for the GUI, and another for the browser. We ultimately collapsed this into a single executor with every tool available and let the model decide what to reach for.

Historically, large tool schemas caused models to lose focus, making sub-agents necessary. Today’s frontier can manage extensive toolsets without degrading performance. More importantly, abandoning sub-agents removes artificial boundaries.

Real computer use is highly cross-modal. If a “code agent” runs a script and a “GUI agent” checks the output, the handoff inevitably loses nuances. A unified executor maintains perfect state continuity. It can write a script, read the terminal, and immediately use the mouse without summarizing its state for someone else.

Its tools fall into a few groups:

  • GUI – click, type, scroll, drag, hotkeys, window management. The basic vocabulary of a person at a keyboard and mouse.

  • Code – direct Python and Bash execution for installing packages, reading and writing files, and anything more naturally done in a few lines than by hand.

  • Browser – direct access to Chrome through the DevTools Protocol for reading page state cleanly.

  • Long-running work – a background-execution tool for commands that take longer than the system’s two-minute limit.

Two of these tools work around strict environmental constraints.

The browser tool must connect to Chrome’s debugging port, which Chrome binds exclusively to the local machine for security. Because our agent runs outside the VM, the executor ships small Python snippets into the machine. These snippets talk to Chrome locally and return only the requested page state.

The long-running tool exists because the environment forcefully kills any command that exceeds 120 seconds. This limit caused some of our strangest failures early on. It is fine for most UI actions but fatal for installing large packages or running slow scripts. Worse, the agent had no way to recover from being abruptly cut off. Instead of holding a connection open and hoping it survives, the executor uses the background tool to launch the command detached. It receives a process handle immediately and polls for completion on its own schedule.

The executor recovering from the two-minute limit. A run_bash job is killed at 2:00; the executor recognizes the timeout and re-runs the work detached with run_background, polling until it completes rather than retrying into the same wall.
GUI first

Models naturally default to writing code. Their training data is heavily saturated with programming languages and command-line utilities. As a result, when faced with a task like updating a spreadsheet, a frontier model will instinctively try to use a library to modify the underlying file directly. Code is fast, deterministic, and highly rewarded during post-training.

We explicitly instructed our executor to prioritize GUIs over these programmatic workarounds. We did this partly because OSWorld evaluates the actual desktop environment.

A spreadsheet generated entirely by a Python library might look correct to a human reader, but it often lacks the specific XML metadata or internal object structures that the actual application creates. More importantly, forcing the agent to use the screen aligns with the core purpose of the benchmark. The goal is to measure how well an AI operates a computer the way a human does.

Distribution of state-changing tool calls across our Opus 4.7 run, each shown as a percentage of all state-changing calls. The agent relies heavily on UI actions like clicking and typing to execute tasks but frequently uses Bash and Python for intermediate data extraction.

This does not mean the agent abandons code entirely. As the distribution of tool calls shows, most trajectories blend both methods. A model will frequently drop into a terminal to write a quick Bash or Python script to locate a hidden file, parse a dense .mbox email archive, or extract an exact dollar amount from a PDF. Once the agent holds the parsed data in its context window, it switches back to the mouse and keyboard to complete the task inside the target application. This hybrid uses the model’s strong coding background for heavy data extraction while keeping its final actions grounded in realistic software operation.

Results

Our highest run (using Opus 4.7) scored 83.6% across the benchmark’s 361 tasks – 290 solved completely and 14 partially. The Sonnet 4.6 run scored 81.5%.

The OSWorld leaderboard is a mix of approaches. Some entries are base models reporting their own scores, some are full systems built around a model, some are specialized solely for the benchmark. Others report an average across runs or best across multiple rollouts (best-of-N, noted as bBoN on the leaderboard). Ours are single runs, and they are the two highest recorded to date.

OSWorld leaderboard (top five scores) showing our top two verified scores (in purple) as of May 26, 2026. Both the Opus 4.7 and Sonnet 4.6 runs clear the previous best. The human baseline is 72.4%.
Where the agent is strong

The headline score is a single figure for more than 360 tasks across ten very different domains. The per-domain breakdown is more useful because it shows exactly where the agent is reliable and where it still slips.

The table below displays our best result in each domain across both runs (simulating a production system that routes tasks to the better-performing model). We compare this against the single best score anyone else has posted, as well as the mean of the top five competing agentic frameworks. The delta column compares our score to that mean, where a positive number indicates a lead.

Our best score in each domain across both runs (Opus 4.7 and Sonnet 4.6), against the best result any agent has posted and its holder. The final column compares our score against the mean of the top five competing frameworks. We hold the top score in 7 of the 10 domains.

We hold or share the top spot in 7 of the 10 domains, and we beat the top-five mean in 9 of them. Notably, the two models do not lead the same domains. Opus takes most of them, but Sonnet is ahead on Impress, VS Code, and VLC. This validates treating the model as a swappable component rather than the entire system – different tasks demand different strengths.

The result we care about most is multi-apps. It is the largest category on the benchmark (93 tasks, roughly a quarter of the total) and the hardest, because the work spans several applications and requires the agent to carry state across them.

It is also the domain that looks most like real work. Almost nothing in a real enterprise happens inside a single window. Most systems struggle here. The field clusters in the low-to-mid sixties, while we scored 74.4%, several points clear of the next-best result.

The same pattern holds across standard productivity applications. Calc, Impress, and Writer all land above 91%. VS Code hits 95.7% and operating-system tasks reach 95.8%. These are the environments closest to the actual back-office work we build for.

The model is a component, not a system

Because our harness stayed fixed across our runs, the results isolate exactly what the scaffolding contributes to models with different baselines. We find this dynamic more interesting than the headline scores themselves, especially when factoring in cost.

OSWorld accuracy against output-token cost per task. Base-model figures are sourced from each model’s system card, while the dotted lines connect them to their performance inside our harness. Both models gain accuracy: Opus achieves a 5.6 percentage-point gain for almost no change in cost, while Sonnet gains 9.4 percentage points and actually becomes cheaper than running it alone.

The two lines don’t move the same way. Opus goes from 78.0% to 83.6% while its per-task cost barely moves, $0.24 to $0.25 – a gain of 5.6 percentage points – for almost no change in cost. Sonnet goes from 72.1% to 81.5% – a 9.4 percentage-point jump – and gets cheaper, $0.16 down to $0.15. The harness made the smaller model both more accurate and less expensive than running it alone, which is to say our system does not trade accuracy for cost and maintains token efficiency.

The most likely reason for this is that the planner spends a small, fixed number of tokens up front to turn the task into milestones. This initial investment buys back far more downstream. Because the plan is clear, the executor minimizes its total steps. It wanders less, gets stuck in fewer loops, and abandons dead-end approaches much sooner.

Seeing the weaker model gain so much ground is also revealing. A lot of what looks like a raw capability gap between two models is actually just vulnerability to minor failure modes. A smaller model is more likely to fail a task because it cannot recover from a tool error, tries to retype a document instead of reading the file, or refuses to stop when a UI approach fails. When the scaffolding solves these mechanical roadblocks, the underlying reasoning abilities of the two models converge. They are much closer in actual capability than their solo scores suggest.

Across our two runs, the full Sonnet system reaches about 98% of the Opus system’s score at roughly 43% of the total cost. The per-task figures above count only output tokens; this 43% is a fuller measure that includes all input and output compute across the run, which is why the gap between the two models is wider here than the output-only numbers suggest. For a benchmark, you just report the highest number. For production, this tradeoff is everything. Every task runs on a budget, and the question is never “what is most capable?” but rather “what is the cheapest model that can reliably complete this work?”

This is why we treat the model as a swappable component. The harness provides structure. The model is simply a setting you tune based on the complexity of the work and the constraints of your budget.

Knowing what it can’t do

The feasibility gate was evaluated on the benchmark’s 28 deliberately impossible tasks. It successfully identified 24 of them while incorrectly flagging only 2 of the 333 feasible tasks as impossible. This translates to an 85.7% recall on infeasible tasks and a 99.4% specificity on valid ones.

The feasibility gate evaluated across OSWorld’s task distribution. It accurately caught 24 impossible tasks while only falsely rejecting 2 valid ones, maintaining high specificity so real work is not abandoned.

Tuning this threshold is difficult. It is easy to build a highly cautious agent that catches every impossible request, but it will inevitably abandon valid work. On the other hand, an agent optimized purely for task completion will fabricate results when it hits a dead end. The hard part is catching the genuine dead-ends without crying wolf on the real ones.

The four impossible tasks the gate failed to catch stem from a mismatch between real-world computing and benchmark constraints. The model attempted these tasks because, in a standard environment, they are possible. A human user would solve them by simply installing an extension, downloading a plugin, or updating the software.

The agent’s pre-trained knowledge correctly recognizes these workarounds as valid. It only fails because OSWorld relies on a tightly restricted operating system where external downloads are blocked. The model’s reasoning was sound for a real computer; it failed only because the sandbox blocks the workarounds a production environment would allow.

Superscore

During internal testing we ran the system many times over. If we take the best result on each task across all our runs with Opus 4.7, the agent scores 90.9% – 317 of the 361 tasks solved completely, and another 13 solved partially. We call this the superscore. With Sonnet it is 87.6%, over a larger number of runs (8 for Opus, 14 for Sonnet).

Unlike our headline 83.6% from a single run, this composite score is not something we would claim on a leaderboard. However, it does serve as a diagnostic tool, proving that the underlying model possesses the reasoning, tool priors, and multi-modal understanding required to solve 9 out of 10 tasks.

The 7% gap between our single-run score and the superscore is entirely variance. When the agent fails a task that it proves capable of solving in a subsequent run, the failure is rarely a lack of intelligence. It is usually a mechanical execution error.

The point we want to highlight is that when a system can solve over 90% of an evaluation, the ceiling is in sight. The core problem is no longer figuring out if a model is smart enough to do the work. It shifts to building guardrails, self-correction loops, and independent verification systems that guarantee the model does the work correctly on the very first attempt.

Interesting findings

A lot of what we learned came from watching the agent run thousands of times. We noticed finer details in its behavior that we didn’t expect. We turned some into architectural improvements, while others revealed interesting quirks about how these models behave in the wild.

Sonnet’s “just do it” problem

Sonnet does not like to give up. On a task to extract hidden audio from an image – a task that was deliberately impossible, because there was no hidden audio – it worked through every steganography tool it could find, decided the file must be password-protected, and then installed a password cracker and a wordlist to brute-force a password that didn’t exist.

On another task, a website blocked it (our tasks run on AWS, which web servers often treat as a scraper), and its solution was to install Tor to route around the block. We found out about that one when AWS emailed us at 1 A.M.

We started calling this the “just do it” mentality: the model will keep going long after a person would have stopped and asked whether the task even makes sense. In the real world it is the most dangerous trait by far. An agent operating real systems needs to know when to stop and escalate the problem back to a person, and the instinct to find a way no matter what is exactly the instinct you don’t want.

This is the other side of the feasibility gate from earlier: knowing when not to start is one half, and knowing when to stop is the other. We have a lot more to say about this in a subsequent post.

Phantom tools

Across both runs, about 0.5% of the agent’s tool calls – 57 out of 11,603 – were calls to tools we never gave it. These were not random, though. They were standard computer use primitives, things like triple_click, ingrained in model weights via heavy reinforcement during computer use training. This prior is so strong that the model reaches for these tools even when they are absent from the provided schema.

Interestingly, when we returned an error saying triple_click doesn’t exist, the model tried again with triple_click_safe, then triple_click_replacement, triple_click_workaround, triple_click_substitute, and at one point triple_click_does_not_exist. It was confident enough the capability should exist that it kept permuting the name looking for one we’d accept, rather than concluding the tool wasn’t there.

We read this as a small window into a tension in heavily RL’d models. The training instills tool-use habits strong enough to override the actual tools provided in context.

It didn’t cost us anything measurable, but it’s a reminder that the model arrives with priors about the tools present in the environment it was trained in that don’t always match the tools in the environment it is operating in.

Letting a model read a file instead of writing code

Early on we noticed the agent doing something wasteful. To look at the contents of a file, it would open a terminal, write a few lines of Python, run them, and read the output. It clearly wanted the file’s contents in front of it, and was writing code as the means to get there. The text contents of the file constitute far fewer tokens than an image containing the same information.

We gave it a more direct route: tools that read a file and hand the contents straight back, with PDFs and images passed in natively as input the model can see. The behavior was telling us what tool it wished it had. Once we built it, the detour disappeared. We think this is a general pattern worth paying attention to – when a model keeps doing something roundabout, it’s often describing a missing tool, and the fix is to listen rather than to prompt it out of the habit.

Two strikes, then switch

A model will try the same thing more times than a person would. The agent would frequently fail at an approach, try it again with a slight variation, and fail again, burning through its step budget on a dead end.

So, we gave the executor a rule: two failed attempts at the same mechanism is the signal to switch approaches entirely. This small addition to the system prompt changed the shape of our trajectories.

The interesting thing to us is that this had to be said at all. The instinct to keep trying the same door is strong in these models, and a surprising amount of getting good performance is teaching the agent when to stop doing the thing it wants to keep doing. It’s the same lesson as the feasibility gate, one level down.

Where the evaluator falls short

The most useful thing we learned from this benchmark is that grading the work is itself a verification problem – and OSWorld’s grader, built with every advantage, was still wrong often enough to matter.

In a benchmark, the task is fixed and known in advance, the correct end state is decided up front, and the grader is written by people who can see exactly what they’re checking for. Even with all of that, we kept hitting cases the grader was wrong:

  • Golden files that were themselves wrong. Roughly 10% of the reference outputs we examined had errors (e.g., a misspelled column name, a column sorted the wrong way) making the task impossible to pass, because the answer it was graded against was incorrect.

  • Valid work it couldn’t recognize. The agent reached the right end state by an unorthodox route, or through a value the grader expected to be hard-coded, and the check failed because it was looking for one specific path.

  • Answers that had gone stale. Some tasks depend on live data. The agent retrieves the current value, the golden file holds an old one, and the two no longer match.

We built our own verifier to see how much of this was the grader rather than the agent. It’s an agent that checks the executor’s work against the actual end state of the machine instead of against a stored answer, and in our experiments it was often the more reliable judge of whether a task was truly done. But the rigidity of the evaluator capped how much it could move our reported score. When the grader itself is the source of truth, a better judge can’t get credit for being right. We think a flexible verifier is a more interesting system than this benchmark can show, and we’ll cover it in a later post.

The reason we keep returning to this is what the evaluator’s errors actually imply. A benchmark like OSWorld is the friendliest possible setting for verification: the task is fixed, the correct end state is settled up front, and the grader was written by people who could see exactly what they were checking.

Even here, the grader was wrong often enough to matter. The bugs themselves are easy to fix; the lesson is harder to dismiss. If verification is this difficult when the answer is known in advance and someone wrote the check by hand, it does not get easier anywhere else.

Beyond the benchmark

We had a lot of fun with this, and we learned an enormous amount watching the agent run thousands of times – about the models, and the small things around them that turn out to matter more than raw capability.

This benchmark is of course just one signal. It tells you the agent can do the work under clear-cut, pre-defined conditions. But it does not tell you whether the same agent holds up against work that is open-ended and consequential.

That second question is where we spend our time. The work we do in production lives in finance and operations, where the tasks are longer, systems are messier, and the actual information you need is scattered across places that were never meant to talk to each other. A system that succeeds here has to assemble that scattered context into an accurate picture before it acts, operate under real constraints on what it is allowed to do, and produce evidence that its work is correct rather than merely plausible. You have to be able to stand behind the work without an answer key in front of you.

This is why we treat a state-of-the-art score as the baseline. A top result proves the underlying models are finally capable enough to do this work. The scaffolding that makes capability reliable enough to put into production, and the verification that proves it worked, is the harder and more interesting problem we spend more time on. It’s also why we bother topping a leaderboard at all: the people we build for should have the best system in the world operating their work, and the benchmark keeps us honest about that.

If any of that sounds like the thing you want to work on, we’re hiring. We’re especially interested in people thinking about computer use, verifying agent work, self-improving systems, and coherence over long horizons.