GameDevBench

Evaluating Agentic Capabilities Through Game Development

GameDevBench

Evaluating Agentic Capabilities Through Game Development

Wayne Chi1, Yixiong Fang1, Arnav Yayavaram1, Siddharth Yayavaram1, Seth Karten2, Qiuhong Anna Wei1, Runkun Chen1, Alexander Wang1, Valerie Chen1, Ameet Talwalkar1, Chris Donahue1
1Carnegie Mellon University    2Princeton University
GameDevBench task examples across 3D Graphics, 2D Graphics, Gameplay, and UI categories

Example tasks from GameDevBench spanning 3D Graphics, 2D Graphics, Gameplay, and UI categories in the Godot engine.

TL;DR

  • GameDevBench is the first benchmark for evaluating LM agents on game development tasks.
  • GameDevBench features 132 tasks set in the Godot engine, collected from web and video tutorials across four skill categories: gameplay logic (collision detectors, character controllers), 3D graphics and animation (material tuning, skeletal animation), 2D graphics and animation (sprite animation, TileMap setup), and user interface (HUD layout, menu navigation).
  • Tasks require agents to navigate large codebases and manipulate multimodal assets such as shaders, sprites, and animations.
  • GameDevBench shows that even simple multimodal feedback can drastically improve performance.

Example Task

In this example, the goal is to populate an empty 3D scene with a water depth visualization, including environment lighting, shader-driven water plane, background spheres, and a camera. This is a 3D graphics and animations task that focuses on the scene editor. The figure shows both the editor-based and code-based solution approaches.

GameDevBench 3D example: water depth visualization task showing editor and code solutions

Leaderboard

Rank Model Org Framework Score
1 gpt-5.3-codex-high OpenAI Codex 59.1
2 gemini-3-pro-preview Google Gemini CLI 54.5
3 gemini-3-flash-preview Google Gemini CLI 52.3
4 claude-sonnet-4-5-20250929 Anthropic OpenHands 51.1
5 claude-opus-4-5-20251101 Anthropic Claude Code 50.0
6 gpt-5.1-codex-max OpenAI OpenHands 49.6
7 claude-opus-4-6 Anthropic Claude Code 49.2
8 claude-sonnet-4-5-20250929 Anthropic Claude Code 47.7
9 gpt-5.1-codex-max OpenAI Codex 41.7
10 gemini-3-flash-preview Google OpenHands 40.5
11 kimi-k2.5 Moonshot AI OpenHands 38.9
12 claude-haiku-4-5-20251001 Anthropic OpenHands 31.3
13 claude-haiku-4-5-20251001 Anthropic Claude Code 25.8
14 qwen3-vl-235b-a22b-instruct Alibaba OpenHands 8.3

* Best multimodal support for each configuration.

Citation

@misc{chi2026gamedevbenchevaluatingagenticcapabilities,
      title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
      author={Wayne Chi and Yixiong Fang and Arnav Yayavaram and Siddharth Yayavaram and Seth Karten and Qiuhong Anna Wei and Runkun Chen and Alexander Wang and Valerie Chen and Ameet Talwalkar and Chris Donahue},
      year={2026},
      eprint={2602.11103},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.11103},
}