GameDevBench
Evaluating Agentic Capabilities Through Game Development
GameDevBench
Evaluating Agentic Capabilities Through Game Development
ICML 2026
Leaderboard
* pass@1 (%) on all 333 tasks — best multimodal feedback configuration per model, in its best harness (ICML 2026 camera-ready results). Error bars are 95% confidence intervals.
TL;DR
333 real tasks in the Godot engine — shaders, sprites, animations, and scenes, not just code.
Even the strongest agent fails nearly half the benchmark.
The more visual understanding a task demands, the more agents fail.
Letting agents see screenshots and gameplay video consistently boosts performance (GPT-5.4 shown).
Example Task
In this example, the goal is to populate an empty 3D scene with a water depth visualization, including environment lighting, shader-driven water plane, background spheres, and a camera. This is a 3D graphics and animations task that focuses on the scene editor. The figure shows both the editor-based and code-based solution approaches.
Citation
@inproceedings{chi2026gamedevbenchevaluatingagenticcapabilities,
title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
author={Wayne Chi and Yixiong Fang and Arnav Yayavaram and Siddharth Yayavaram and Seth Karten and Qiuhong Anna Wei and Runkun Chen and Alexander Wang and Valerie Chen and Ameet Talwalkar and Chris Donahue},
booktitle={International Conference on Machine Learning (ICML)},
year={2026},
eprint={2602.11103},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.11103},
}