EDIT-Bench Leaderboard
Pass rates for LLMs on EDIT-Bench. Core: 108 problems (subset) | Complete: 540 problems (full set). Higher is better.
| Rank | Model | Core (%) | Complete (%) | Type |
|---|---|---|---|---|
| 1 | claude-sonnet-4 | 66.67 | 64.81 | Closed |
| 2 | claude-sonnet-4.5 | 60.19 | 59.81 | Closed |
| 3 | claude-3.7-sonnet | 62.04 | 59.26 | Closed |
| 4 | claude-3.5-sonnet | 63.89 | 59.07 | Closed |
| 5 | kimi-k2-0905 | 58.33 | 56.48 | Open |
| 6 | glm-4.6 | 55.56 | 56.48 | Open |
| 7 | gpt-o3-mini | 62.96 | 56.30 | Closed |
| 8 | deepseek-chat-v3.1 | 59.26 | 54.26 | Open |
| 9 | gpt-5-mini | 52.78 | 54.07 | Closed |
| 10 | qwen3-coder | 55.56 | 53.89 | Open |
| 11 | gpt-o4-mini (high) | 59.26 | 53.70 | Closed |
| 12 | gpt-4o | 53.70 | 53.33 | Closed |
| 13 | gpt-o3-mini (high) | 53.70 | 52.78 | Closed |
| 14 | gpt-5 (high) | 56.48 | 52.78 | Closed |
| 15 | gpt-o4-mini | 57.41 | 52.78 | Closed |
| 16 | grok-4-fast | 52.78 | 52.04 | Closed |
| 17 | gemini-2.5-flash | 51.85 | 51.85 | Closed |
| 18 | gemini-2.5-pro | 54.63 | 51.30 | Closed |
| 19 | grok-code-fast-1 | 53.70 | 50.93 | Closed |
| 20 | qwen3-coder-flash | 51.85 | 50.74 | Closed |
| 21 | llama-3.3-70b-instruct | 51.85 | 49.63 | Open |
| 22 | llama-4-maverick | 50.93 | 49.44 | Open |
| 23 | gpt-5 | 51.85 | 49.26 | Closed |
| 24 | llama-3.1-405b-instruct | 48.15 | 48.70 | Open |
| 25 | gpt-oss-20b | 50.00 | 48.15 | Open |
| 26 | gpt-4o-mini | 50.00 | 47.78 | Closed |
| 27 | mistral-small-3.2-24b-instruct | 43.52 | 46.30 | Open |
| 28 | qwen3-14b | 47.22 | 45.93 | Open |
| 29 | gpt-5-nano | 47.22 | 45.74 | Closed |
| 30 | qwen-2.5-72b-instruct | 53.70 | 45.19 | Open |
| 31 | mistralai-codestral-2508 | 43.52 | 44.81 | Closed |
| 32 | deepseek-r1-0528 | 41.67 | 44.44 | Open |
| 33 | llama-4-scout | 45.37 | 43.33 | Open |
| 34 | qwen3-30b-a3b | 43.52 | 43.15 | Open |
| 35 | gpt-oss-120b | 44.44 | 41.30 | Open |
| 36 | devstral-medium | 50.00 | 41.11 | Closed |
| 37 | qwen-2.5-coder-32b-instruct | 53.70 | 40.00 | Open |
| 38 | gemma-3-27b-it | 29.63 | 37.04 | Open |
| 39 | devstral-small | 48.15 | 36.67 | Open |
| 40 | llama-3.1-8b-instruct | 37.96 | 34.07 | Open |
| 41 | kimi-dev-72b | 33.33 | 31.67 | Open |
| 42 | gemma-3-12b-it | 23.15 | 30.00 | Open |
| 43 | gemma-3n-e4b-it | 31.48 | 29.26 | Open |
| 44 | glm-4.5 | 29.63 | 29.07 | Open |
Full results: Core (108) | Complete (540)
Citation
@misc{chi2025editbenchevaluatingllmabilities,
title={EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits},
author={Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue},
year={2025},
eprint={2511.04486},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.04486},
}