VIBEPASS, a brand new benchmark, reveals a basic weak point in trendy AI coding assistants: even with near-perfect scores on code technology duties, frontier fashions falter in the case of discovering and fixing refined bugs
The Phantasm of Competence
We live by way of an period of speedy AI coding functionality. Methods like GPT-5, Gemini-3-Professional, and Claude Opus-4.6 routinely exceed 90% on commonplace code technology benchmarks. AI can now write code—that’s not up for debate. Builders are transport total options from a number of prompts. The business even has a reputation for this: “vibe coding”, the place people supervise at a distance whereas fashions do the constructing.
VIBEPASS is designed to check a more durable query: not whether or not fashions can generate code, however whether or not they can motive about faults in code that look almost right.
“Given {a partially} right program with no observable failures, can an LLM decide the answer to be defective, synthesize a concrete take a look at witnessing the latent fault and exploit that analysis to restore it?”
Primarily based on our analysis of frontier fashions throughout rigorously constructed take a look at situations, the quick reply is not any, these methods don’t carry out this activity reliably, not even shut.
| 173 Benchmark situations throughout 76 issues |
98% ‘Medium’ or ‘Laborious’ coding issues from Leetcode, Atcoder |
71% Median success-rate of buggy options |
12 Frontier fashions evaluated |
5 Duties evaluating fault-targeted reasoning |
What Is VIBEPASS Testing?
VIBEPASS examines a standard failure mode in AI-generated code: options that largely works however break on edge instances that matter. The problem is whether or not frontier fashions can catch these bugs and refine the code till the output is ideal.
- Decide: Decide if the code is right or defective
Given simply the issue assertion and the answer, classify whether or not the code comprises a bug - Check: Generate a fault-triggering (FT) enter
Produce a concrete take a look at that (a) satisfies the issue constraints, (b) causes the buggy resolution to provide the mistaken output, and (c) causes an accurate resolution to provide the suitable output. - Debug: Restore the code
Utilizing the fault-triggering take a look at as a diagnostic, generate a corrected resolution that passes all official take a look at instances.
This three-step course of displays how software program engineers truly work and is a requirement for any coding agent in manufacturing. VIBEPASS reveals that present fashions wrestle with the Decide → FT-Check → Debug sequence way over their commonplace code-generation efficiency numbers counsel.

We first research two Fault-Triggering (FT) Check Technology settings: Bug-Conscious, the place the mannequin is aware of this system is buggy and should generate a take a look at to reveal it, measuring reasoning with recognized faults, and Bug-Discovery, the place it first decides whether or not this system is buggy, then generates a take a look at solely when a bug is detected, measuring joint fault detection and take a look at technology. Collectively, they isolate the impact of bug consciousness on fault-targeted reasoning.
Discovering 1: Syntactic Competence ≠ Fault Reasoning
The primary consequence highlights a transparent hole between two issues which can be straightforward to confuse: writing a sound take a look at case and writing one that truly finds a bug. On common throughout 12 fashions, 86.4% of the take a look at inputs had been ‘legitimate’ and adopted all the foundations. However solely 61.3% truly triggered the fault. That’s a 25-point hole between writing one thing that passes as a sound take a look at and writing a take a look at that truly exposes a bug.
| 86% Avg. syntactically legitimate inputs |
61% Avg. fault-triggering assessments |
25pp Hole: validity vs. discrimination |
54pp Finest-to-worst mannequin unfold |
There are two failure modes right here. The first one is the “fault speculation” hole, a 23-point distinction between writing a sound enter and discovering one that truly triggers the bug. The smaller, 2-point “output validation” hole represents the problem of predicting the suitable output as soon as the enter is discovered.
KEY INSIGHT
The issue isn’t producing legitimate inputs or predicting outputs, it’s incapacity to motive which enter will expose the fault. As soon as a mannequin finds the suitable take a look at enter, it normally will get the reply proper. The laborious half is fault-targeted reasoning. Fashions nonetheless can’t do that reliably.
The variations between fashions are substantial. Claude Opus-4.6 reaches 80% on bug-aware fault-triggering assessments, whereas Gemini-3.1-Professional and GPT-5.2 fails to attain 70%, regardless of all being labeled as “flagship reasoning fashions”. GPT-5-Nano presents one other attention-grabbing case: it excels at rule-following (VI: 92.5%) however struggles to really uncover bugs (DIO: 52%). That 40-percentage-point hole highlights an important distinction between showing competent and being genuinely efficient. Collectively, these outcomes counsel that fault-targeted reasoning is a distinctly discriminative functionality.

Discovering 2: Bug Detection Is A Second Bottleneck
These first outcomes assume the mannequin already is aware of a bug exists. In a extra lifelike “bug-discovery” situation, the place the AI should determine for itself if the code is damaged, the numbers drop. Accuracy falls to 71.4%, which means fashions decide mistaken almost 30% of the time. Total success lands just below 50% when the bug-detection isn’t handed to them upfront.
A nuanced development emerges: for the robust fashions, being advised a bug exists doesn’t matter very similar to Sonnet-4.6 or may even typically damage efficiency (e.g. GPT-OSS-120B). They use their very own confidence to determine whether or not to behave; if not sure, it passes, holding the efficiency hole (DIO → J+DIO) between two settings minimal. Weaker fashions like GPT-5-Nano present the reverse, truly benefiting from the “bug-awareness” as a result of their inside judgment is just too unreliable to face by itself.
| 0.6% DIO → J+DIO Acquire GPT-OSS-120B |
0.6% DIO → J+DIO Drop Sonnet-4.6 |
2.9% DIO → J+DIO Drop GPT-5.2 |
15% DIO → J+DIO Drop GPT-5(Mini) |
34.1% DIO → J+DIO Drop GPT-5(Nano) |
Discovering 3: Check-Guided Restore Is Not What You’d Hope
The pure assumption is easy: hand a mannequin a concrete failing take a look at and debugging will get simpler. The VIBEPASS outcomes contradict this. Three restore circumstances had been evaluated: NoTest (mannequin is aware of a bug exists however has no take a look at case), ExtTest (externally supplied with a fault-triggering take a look at generated by the identical mannequin), and IntTest (mannequin generates its personal take a look at internally first), contrasting unguided restore vs. exterior vs. self-generated diagnostic context.
Essentially the most shocking discovering: when take a look at high quality is managed for (i.e. analyzing solely situations the place each the exterior and self-generated assessments are efficiently fault-triggering), self-generated assessments by robust reasoners outperform exterior ones (e.g. GPT-5.2 Codex achieve by 16.9pp), whereas for weak reasoners self-generated assessments underperform in comparison with explicitly offered ones (e.g. Gemini-3.1-Flash lite losses 28.1pp).
KEY INSIGHT
Check provenance issues: robust fashions leverage the implicit context alignment when a take a look at emerges from the identical chain of reasoning that produces the repair, whereas for weaker fashions, further take a look at technology splits the reasoning capability accessible for debugging.

The broader image is extra unsettling. Restore efficiency beneath all three take a look at circumstances falls beneath even code technology, the baseline activity fashions are most practiced at. NoTest barely trails it, however offering any take a look at makes issues worse: ExtTest drops additional, and IntTest drops furthest of all.
The instinct — give a mannequin a concrete failing take a look at and it debugs higher — merely doesn’t maintain. A take a look at that fails to genuinely expose the fault doesn’t simply fail to assist; it actively misleads. Fashions anchor to the dangerous diagnostic sign and patch within the mistaken route, performing worse than in the event that they’d obtained no steering in any respect. They at present lack the robustness to filter dangerous diagnostic indicators from good ones.
KEY INSIGHT
Fault-targeted program restore for partially right options is tougher than code-synthesis from scratch, even for robust reasoning fashions. Check-guided restore is bottlenecked by take a look at high quality. Including a take a look at isn’t a free improve — it’s a wager. And proper now, that wager loses extra usually than it wins.

Discovering 4: The Pipeline Has Two Cliffs
VIBEPASS maps the complete fault-reasoning pipeline as a cumulative waterfall, every step requiring all prior ones to carry. Efficiency erodes at each step, however two transitions are steeper than the others.
The primary cliff: shifting from legitimate output prediction to fault-triggering enter technology. This drop averages −14.7 proportion factors, the fault speculation bottleneck made quantitative, the place basic execution skill runs out and causal fault-targeted reasoning should take over.
The second cliff: shifting from a sound fault-triggering take a look at to a profitable restore. This drop averages −21.2 proportion factors. Even fashions that efficiently expose a bug by way of a focused take a look at usually fail to translate that analysis right into a working repair.

Why This Issues Past Benchmarks
The hole VIBEPASS exposes just isn’t a tutorial curiosity. The structure of contemporary AI coding methods, the place one LLM generates code, one other evaluates, and one other patches, will depend on precisely the potential VIBEPASS exhibits is poor: reasoning from “this code produces some mistaken outputs” to “here’s a concrete witness to the fault, and right here is why the logic is mistaken.”
THE REAL-WORLD IMPLICATION
Each “vibe coded” codebase comprises bugs that look right to the take a look at suite. The query just isn’t whether or not AI can write code that passes assessments. The query is whether or not AI can discover the assessments its personal code would fail. VIBEPASS exhibits that at the moment’s fashions largely can not.
Code-generation benchmarks and remoted take a look at move charges are poor proxies for actual functionality. What issues is fault-targeted analysis, the place analysis and restore are tightly coupled throughout the similar agent context. VIBEPASS underscores why: naive pipelines that merely add extra assessments can backfire. Invalid or non-discriminating assessments—people who don’t expose the fault—usually mislead fashions into producing incorrect patches, degrading total efficiency fairly than bettering it.
What Would Progress Look Like?
The bigger fault speculation hole signifies that coaching indicators centered on fault-targeted reasoning, not simply code technology or output prediction, are probably to maneuver the needle. Strategies like fault localization coaching, contrastive examples of buggy vs. right conduct, and reinforcement studying on execution suggestions seem extra related than scaling code technology capability alone.
The discovering that self-generated assessments outperform exterior ones in managed circumstances hints at an architectural choice: agentic methods might profit from holding the fault speculation and restore throughout the similar reasoning context, fairly than decomposing them throughout mannequin calls with take a look at instances because the handoff artifact.
The Backside Line
VIBEPASS arrives at a vital second. As AI coding methods evolve from autocomplete instruments to autonomous brokers that write, overview, and deploy full modules, fault reasoning turns into important. This functionality is distinct from basic coding ability—and at the moment’s frontier fashions nonetheless wrestle, attaining beneath 50% success on end-to-end bug discovery and localization.
The vibe coders can move the usual checks. They can not but reliably move the vibe examine.
FURTHER READING
VIBEPASS paper : https://arxiv.org/abs/2603.15921
Dataset : huggingface.co/datasets/Salesforce/vibepass
Analysis code: github.com/SalesforceAIResearch/vibepass
The benchmark attracts issues from LiveCodeBench and is designed to replace repeatedly to withstand contamination.

