EDIT-Bench Logo

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

*Equal Contribution
1Carnegie Mellon University    2UC Berkeley

TL;DR

EDIT-Bench (Evaluation of Developer Instructed Tasks) is the first benchmark for evaluating LLM code editing capabilities built on real-world edit contexts and instructions collected in-the-wild. We gathered data from nearly 500 developers using a VS Code extension, creating 540 problems across 5 natural languages and 2 programming languages. EDIT-Bench tests models on diverse, context-dependent problems that require understanding user instructions, code context, highlighted code, and cursor position—reflecting how developers actually use AI coding assistants.

EDIT-Bench Leaderboard

Pass rates for LLMs on EDIT-Bench. Core: 108 problems (subset) | Complete: 540 problems (full set). Higher is better.

Rank Model Core (%) Complete (%) Type
1claude-sonnet-466.6764.81Closed
2claude-sonnet-4.560.1959.81Closed
3claude-3.7-sonnet62.0459.26Closed
4claude-3.5-sonnet63.8959.07Closed
5kimi-k2-090558.3356.48Open
6glm-4.655.5656.48Open
7gpt-o3-mini62.9656.30Closed
8deepseek-chat-v3.159.2654.26Open
9gpt-5-mini52.7854.07Closed
10qwen3-coder55.5653.89Open
11gpt-o4-mini (high)59.2653.70Closed
12gpt-4o53.7053.33Closed
13gpt-o3-mini (high)53.7052.78Closed
14gpt-5 (high)56.4852.78Closed
15gpt-o4-mini57.4152.78Closed
16grok-4-fast52.7852.04Closed
17gemini-2.5-flash51.8551.85Closed
18gemini-2.5-pro54.6351.30Closed
19grok-code-fast-153.7050.93Closed
20qwen3-coder-flash51.8550.74Closed
21llama-3.3-70b-instruct51.8549.63Open
22llama-4-maverick50.9349.44Open
23gpt-551.8549.26Closed
24llama-3.1-405b-instruct48.1548.70Open
25gpt-oss-20b50.0048.15Open
26gpt-4o-mini50.0047.78Closed
27mistral-small-3.2-24b-instruct43.5246.30Open
28qwen3-14b47.2245.93Open
29gpt-5-nano47.2245.74Closed
30qwen-2.5-72b-instruct53.7045.19Open
31mistralai-codestral-250843.5244.81Closed
32deepseek-r1-052841.6744.44Open
33llama-4-scout45.3743.33Open
34qwen3-30b-a3b43.5243.15Open
35gpt-oss-120b44.4441.30Open
36devstral-medium50.0041.11Closed
37qwen-2.5-coder-32b-instruct53.7040.00Open
38gemma-3-27b-it29.6337.04Open
39devstral-small48.1536.67Open
40llama-3.1-8b-instruct37.9634.07Open
41kimi-dev-72b33.3331.67Open
42gemma-3-12b-it23.1530.00Open
43gemma-3n-e4b-it31.4829.26Open
44glm-4.529.6329.07Open

Full results: Core (108) | Complete (540)

Citation

@misc{chi2025editbenchevaluatingllmabilities,
      title={EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits},
      author={Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue},
      year={2025},
      eprint={2511.04486},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2511.04486},
}