GameDevBench

Evaluating Agentic Capabilities Through Game Development

ICML 2026

Wayne Chi¹, Yixiong Fang¹, Arnav Yayavaram¹, Siddharth Yayavaram¹, Seth Karten², Qiuhong Anna Wei¹, Runkun Chen¹, Alexander Wang¹, Valerie Chen¹, Ameet Talwalkar¹, Chris Donahue¹

¹Carnegie Mellon University ²Princeton University

Paper Code

Leaderboard

gemini-3-pro-preview [Gemini CLI]

53.8% ±5.4

gpt-5.4 [Codex]

52.0% ±5.4

gemini-3-flash-preview [Gemini CLI]

46.8% ±5.4

gpt-5.4-mini [Codex]

43.2% ±5.3

claude-sonnet-4-5 [Claude Code]

34.8% ±5.1

kimi-k2.5 [OpenHands]

20.7% ±4.4

claude-haiku-4-5 [Claude Code]

18.6% ±4.2

qwen3.5-397b [OpenHands]

5.4% ±2.4

0% 25% 50% 75% 100%

* pass@1 (%) on all 333 tasks — best multimodal feedback configuration per model, in its best harness (ICML 2026 camera-ready results). Error bars are 95% confidence intervals.

333

Tasks

Tutorials

Skill Categories

53.8%

Best Agent Score

TL;DR

The first game-dev benchmark for agents

2D Graphics 33% 3D Graphics 27% UI 20% Gameplay 20%

333 real tasks in the Godot engine — shaders, sprites, animations, and scenes, not just code.

Agents struggle

53.8%best agent score

Even the strongest agent fails nearly half the benchmark.

Multimodality is the bottleneck

Gameplay

51.4%

3D Graphics

38.4%

2D Graphics

33.0%

32.0%

The more visual understanding a task demands, the more agents fail.

Visual feedback works

41.1% → 52.0% +10.9

Letting agents see screenshots and gameplay video consistently boosts performance (GPT-5.4 shown).

Example Task

In this example, the goal is to populate an empty 3D scene with a water depth visualization, including environment lighting, shader-driven water plane, background spheres, and a camera. This is a 3D graphics and animations task that focuses on the scene editor. The figure shows both the editor-based and code-based solution approaches.

GameDevBench 3D example: water depth visualization task showing editor and code solutions

Citation

@inproceedings{chi2026gamedevbenchevaluatingagenticcapabilities,
      title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
      author={Wayne Chi and Yixiong Fang and Arnav Yayavaram and Siddharth Yayavaram and Seth Karten and Qiuhong Anna Wei and Runkun Chen and Alexander Wang and Valerie Chen and Ameet Talwalkar and Chris Donahue},
      booktitle={International Conference on Machine Learning (ICML)},
      year={2026},
      eprint={2602.11103},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.11103},
}