GameDevBench

Evaluating Agentic Capabilities Through Game Development

GameDevBench

Evaluating Agentic Capabilities Through Game Development

ICML 2026

Wayne Chi1, Yixiong Fang1, Arnav Yayavaram1, Siddharth Yayavaram1, Seth Karten2, Qiuhong Anna Wei1, Runkun Chen1, Alexander Wang1, Valerie Chen1, Ameet Talwalkar1, Chris Donahue1
1Carnegie Mellon University    2Princeton University

Leaderboard

gemini-3-pro-preview [Gemini CLI]
53.8% ±5.4
gpt-5.4 [Codex]
52.0% ±5.4
gemini-3-flash-preview [Gemini CLI]
46.8% ±5.4
gpt-5.4-mini [Codex]
43.2% ±5.3
claude-sonnet-4-5 [Claude Code]
34.8% ±5.1
kimi-k2.5 [OpenHands]
20.7% ±4.4
claude-haiku-4-5 [Claude Code]
18.6% ±4.2
qwen3.5-397b [OpenHands]
5.4% ±2.4
0% 25% 50% 75% 100%

* pass@1 (%) on all 333 tasks — best multimodal feedback configuration per model, in its best harness (ICML 2026 camera-ready results). Error bars are 95% confidence intervals.

333
Tasks
88
Tutorials
4
Skill Categories
53.8%
Best Agent Score

TL;DR

The first game-dev benchmark for agents
2D Graphics 33% 3D Graphics 27% UI 20% Gameplay 20%

333 real tasks in the Godot engine — shaders, sprites, animations, and scenes, not just code.

Agents struggle
53.8%best agent score

Even the strongest agent fails nearly half the benchmark.

Multimodality is the bottleneck
Gameplay
51.4%
3D Graphics
38.4%
2D Graphics
33.0%
UI
32.0%

The more visual understanding a task demands, the more agents fail.

Visual feedback works
41.1% 52.0% +10.9

Letting agents see screenshots and gameplay video consistently boosts performance (GPT-5.4 shown).

Example Task

In this example, the goal is to populate an empty 3D scene with a water depth visualization, including environment lighting, shader-driven water plane, background spheres, and a camera. This is a 3D graphics and animations task that focuses on the scene editor. The figure shows both the editor-based and code-based solution approaches.

GameDevBench 3D example: water depth visualization task showing editor and code solutions

Citation

@inproceedings{chi2026gamedevbenchevaluatingagenticcapabilities,
      title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
      author={Wayne Chi and Yixiong Fang and Arnav Yayavaram and Siddharth Yayavaram and Seth Karten and Qiuhong Anna Wei and Runkun Chen and Alexander Wang and Valerie Chen and Ameet Talwalkar and Chris Donahue},
      booktitle={International Conference on Machine Learning (ICML)},
      year={2026},
      eprint={2602.11103},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.11103},
}