Key Highlights
- EVMbench uses 120 real vulnerabilities from 40 audits to score agents across detect, patch, and exploit modes.
- In exploit mode, GPT-5.3-Codex scored 72.2%, versus 31.9% for GPT-5, underscoring rapid capability gains.
- OpenAI says it will release tasks and tooling and commit $10M in API credits to accelerate defensive security research.
OpenAI and crypto venture firm Paradigm have introduced EVMbench, a benchmark designed to measure how capable AI “agents” are at finding, fixing, and exploiting high-severity vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts.
The benchmark is built around 120 curated vulnerabilities pulled from 40 audit repositories, with most cases sourced from competitive audit settings, including Code4rena-style contests.
As per the official announcement, OpenAI said EVMbench also includes scenarios inspired by the security audit process for Tempo, a payments-focused Layer 1 aimed at stablecoin transfers, an area the company expects to become more relevant as “agentic” payments grow.
What EVMbench actually tests
EVMbench scores models in three modes that mirror real security workflows:
- Detect: Agents audit a contract repository and are graded on how many known, ground-truth issues they identify—tied to the audit’s reward structure.
- Patch: Agents modify vulnerable code and must remove exploitability while preserving intended functionality, verified through automated tests and exploit checks.
- Exploit: Agents attempt an end-to-end “fund-draining” attack against deployed contracts in a sandboxed chain environment, with grading based on transaction replay and on-chain verification.
To keep results reproducible and to reduce “grader gaming,” OpenAI says it built a Rust-based harness that deterministically replays transactions and restricts unsafe RPC methods. Exploit tasks run on a local Anvil instance rather than live networks and focus on historical, publicly documented vulnerabilities.
Early results: Exploit is where agents look strongest
OpenAI reports that agents currently perform best in the exploit setting, where the objective is explicit: keep iterating until funds are drained. In that mode, GPT-5.3-Codex via Codex CLI scored 72.2%, a sharp jump from GPT-5 at 31.9%.
Detection and patching remain meaningfully weaker: OpenAI notes agents sometimes stop after finding a single bug in detection mode, and patching is hard because it requires eliminating subtle vulnerabilities without breaking functionality.
Why this matters for crypto security teams
For security researchers and DeFi builders, EVMbench lands amid a growing worry: as AI agents get better at code execution and iterative planning, they could compress the time from “bug discovery” to “chain exploit.”
OpenAI’s paper frames this as a capability that can benefit defenders but also increases offensive risk in a domain where exploits can be fast and irreversible.
OpenAI positions EVMbench as both a measurement and a push toward defense. The company says it is expanding safeguards for dual-use cybersecurity capabilities, including monitoring and “trusted access” controls. By investing in ecosystem programs like the private beta of its security research agent “Aardvark” and committing $10 million in API credits through its Cybersecurity Grant Program to support defensive work.
If EVMbench becomes widely adopted, it could also shape how auditors and protocols evaluate AI-assisted tooling, separating models that can merely describe vulnerabilities from those that can reliably prove exploitability, ship safe patches, and avoid false positives in production-grade codebases.
Also Read: ETHDenver 2026: Vitalik Buterin Discusses AI and Perfect Markets
