GameDevBench

Evaluating Agentic Capabilities Through Game Development

Wayne Chi¹, Yixiong Fang¹, Arnav Yayavaram¹, Siddharth Yayavaram¹, Seth Karten², Qiuhong Anna Wei¹, Runkun Chen¹, Alexander Wang¹, Valerie Chen¹, Ameet Talwalkar¹, Chris Donahue¹

¹Carnegie Mellon University ²Princeton University

Paper Code

GameDevBench task examples across 3D Graphics, 2D Graphics, Gameplay, and UI categories

Example tasks from GameDevBench spanning 3D Graphics, 2D Graphics, Gameplay, and UI categories in the Godot engine.

TL;DR

GameDevBench is the first benchmark for evaluating LM agents on game development tasks.
GameDevBench features 132 tasks set in the Godot engine, collected from web and video tutorials across four skill categories: gameplay logic (collision detectors, character controllers), 3D graphics and animation (material tuning, skeletal animation), 2D graphics and animation (sprite animation, TileMap setup), and user interface (HUD layout, menu navigation).
Tasks require agents to navigate large codebases and manipulate multimodal assets such as shaders, sprites, and animations.
GameDevBench shows that even simple multimodal feedback can drastically improve performance.

Example Task

In this example, the goal is to populate an empty 3D scene with a water depth visualization, including environment lighting, shader-driven water plane, background spheres, and a camera. This is a 3D graphics and animations task that focuses on the scene editor. The figure shows both the editor-based and code-based solution approaches.

GameDevBench 3D example: water depth visualization task showing editor and code solutions

Leaderboard

Rank	Model	Org	Framework	Score
1	gpt-5.3-codex-high	OpenAI	Codex	59.1
2	gemini-3-pro-preview	Google	Gemini CLI	54.5
3	gemini-3-flash-preview	Google	Gemini CLI	52.3
4	claude-sonnet-4-5-20250929	Anthropic	OpenHands	51.1
5	claude-opus-4-5-20251101	Anthropic	Claude Code	50.0
6	gpt-5.1-codex-max	OpenAI	OpenHands	49.6
7	claude-opus-4-6	Anthropic	Claude Code	49.2
8	claude-sonnet-4-5-20250929	Anthropic	Claude Code	47.7
9	gpt-5.1-codex-max	OpenAI	Codex	41.7
10	gemini-3-flash-preview	Google	OpenHands	40.5
11	kimi-k2.5	Moonshot AI	OpenHands	38.9
12	claude-haiku-4-5-20251001	Anthropic	OpenHands	31.3
13	claude-haiku-4-5-20251001	Anthropic	Claude Code	25.8
14	qwen3-vl-235b-a22b-instruct	Alibaba	OpenHands	8.3

* Best multimodal support for each configuration.

Citation

@misc{chi2026gamedevbenchevaluatingagenticcapabilities,
      title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
      author={Wayne Chi and Yixiong Fang and Arnav Yayavaram and Siddharth Yayavaram and Seth Karten and Qiuhong Anna Wei and Runkun Chen and Alexander Wang and Valerie Chen and Ameet Talwalkar and Chris Donahue},
      year={2026},
      eprint={2602.11103},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.11103},
}