Spatial Benchmark

Spatial Benchmark is a benchmark for measuring how well language models reason about 3D space.

It focuses on the kinds of spatial tasks that show up in games, XR, simulation, scene generation, UI placement, and agent tooling: coordinates, transforms, hierarchy effects, visibility, collision-style constraints, and vector math. The benchmark is built on Inspect AI and packaged as a single Inspect task with a custom typed scorer.

This repository is intended to be publishable and reusable. It includes:

A benchmark dataset with 80 questions across 7 categories
A custom evaluator that supports multiple spatial answer formats
Context-backed questions that inject real scene JSON into the prompt
An executable JSON DSL for vector-method tasks
A browser-based editor for maintaining the dataset
Small helper scripts for running evaluations and browsing results

Why this benchmark exists

Many model benchmarks test language, coding, or generic reasoning, but spatial reasoning is usually underrepresented or mixed into broader suites. This benchmark is designed to isolate that skill and make it inspectable.

The goal is not to test whether a model "sounds spatial". The goal is to test whether it can return correct, machine-checkable answers for concrete spatial problems.

At a glance

80 benchmark questions in v1.json
7 top-level categories
10 questions that use bundled scene JSON as runtime context
11 questions that require returning an executable JSON DSL program instead of a computed answer
Built as an Inspect AI task in spatial_eval.py
Editable through a local browser UI served by editor.py

What the benchmark covers

Category	Questions	What it covers
Coordinate & Geometric Math	10	Scene placement, interpolation, scale, proportion, compositional placement
Transformations & Perspective	13	Local vs. world space, coordinate systems, egocentric placement, Euler reasoning
Constraint Satisfaction	10	Packing, layout constraints, non-overlap, geometric feasibility
Spatial Arrangement & Layout	13	World-space arrangement and world-space UI layout
Relational & Topological	11	Occlusion, frustum culling, directional relations
Hierarchical Structuring	11	Parent-child semantics, destruction effects, pivot placement
Linear Algebra & Vector Methods	12	Dot/cross products, reflection, projection, steering, signed angles

Some questions are pure numeric geometry. Others use exported scene graphs from sample-scene.json and sampe-ui-scene.json. The benchmark also includes tasks where the model must synthesize a small JSON program that is executed and tested by the scorer.

Example questions

Coordinate geometry

Write the coordinates for 12 objects evenly spaced on the upper semicircle of radius 10 centered at the origin on the XZ plane. Use Y=0 for every object, include both endpoints at (-10, 0, 0) and (10, 0, 0), and list the coordinates in order from left endpoint to right endpoint.

ANSWER: [[-10.0, 0, 0], [-9.595, 0, 2.817], ..., [10.0, 0, 0]]

Context-backed scene question

Where would be the best place to put a chair for someone using the Desk? Give your answer as [x, y, z].

This question is paired with sample-scene.json, which is injected into the prompt at evaluation time.

ANSWER: [0, 0, -100]

Layout validation

Please arrange the UI elements so that every movable element is fully visible and no two movable elements overlap. You may change only the local positions of Searchbar, Tile1, Tile2, and Tile3. Do not move or rotate Backplate. Return only the objects that need to change.

ANSWER: {
  "Searchbar": { "position": { "x": 0, "y": 22.091691970825195, "z": 1 } },
  "Tile3": { "position": { "x": 25, "y": -5.365461349487305, "z": 3 } }
}

Executable vector-method task

An AI guard in a stealth game should spot the player only if the player is inside the guard's vision cone. Inputs: eye_pos, guard_forward, player_pos, half_fov_deg. Return a valid JSON DSL program, not a computed answer.

ANSWER: {
  "return": "visible",
  "steps": [
    {"out":"to_player","op":"sub","args":["player_pos","eye_pos"]},
    {"out":"dir_n","op":"normalize","args":["to_player"]},
    {"out":"guard_forward_n","op":"normalize","args":["guard_forward"]},
    {"out":"score","op":"dot","args":["guard_forward_n","dir_n"]},
    {"out":"limit","op":"cos_deg","args":["half_fov_deg"]},
    {"out":"visible","op":"gte","args":["score","limit"]}
  ]
}

Repository layout

spatial-benchmark/
├── spatial_eval.py          # Inspect AI task, loader, scorer, answer validators
├── v1.json                  # Benchmark dataset
├── editor.py                # Local dataset editor server
├── editor.html              # Browser UI for editing questions
├── index.html               # Static viewer for browsing questions and answers
├── sample-scene.json        # Bundled scene export used by context-backed questions
├── sampe-ui-scene.json      # Bundled UI scene export used by context-backed questions
├── run_benchmark.sh         # Canonical benchmark runner with logs + run metadata
├── results.md               # Historical notes from earlier runs
└── requirements.txt         # Core Python dependencies

Quickstart

1. Create an environment

git clone https://github.com/specs-devs/spatial-benchmark.git
cd spatial-benchmark

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If you want to use the local editor UI, also install:

pip install fastapi uvicorn

2. Set provider credentials

Set the API keys for whichever model providers you want to use. For example:

export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...

Depending on the model strings you use, you may also need credentials for other providers referenced in the helper scripts.

3. Run the benchmark

inspect eval spatial_eval.py@spatial_benchmark --model openai/gpt-5.4

You can also run multiple models in one command:

inspect eval spatial_eval.py@spatial_benchmark \
  --model anthropic/claude-sonnet-4-6 \
          openai/gpt-5.4 \
          google/gemini-3-flash-preview

Then open the Inspect results UI:

inspect view

Helper scripts

One shell helper is included:

run_benchmark.sh is the canonical runner. It loads .env, runs models one at a time, stores raw stdout logs plus Inspect logs under results/<timestamp>/, and writes run metadata and a per-model status summary for later analysis.

These scripts are convenience wrappers, not part of the benchmark definition itself. Feel free to edit them to match your own provider setup and preferred model list.

Dataset format

The dataset lives in v1.json. Each entry is a JSON object with these fields:

Field	Required	Description
`input`	Yes	The benchmark question shown to the model
`target`	Yes	The canonical typed target or validator payload
`category`	Yes	Top-level category
`subcategory`	Yes	More specific grouping
`context_file`	No	Relative path to a bundled JSON file to inject into the prompt
`context_label`	No	Human-readable label for the injected context
`validation`	No	Optional alternate validator target used instead of raw `target`

Example question records:

{
  "input": "Where would be the best place to put a chair for someone using the `Desk`? Give your answer as [x, y, z].",
  "target": [0, 0, -100],
  "validation": {
    "__type__": "vector3_constraints",
    "x": {"value": 0},
    "y": {"value": 0},
    "z": {"min": -130, "max": -70}
  },
  "category": "Coordinate & Geometric Math",
  "subcategory": "Scene Placement",
  "context_file": "sample-scene.json",
  "context_label": "`sample-scene` sample scene"
}

{
  "input": "An AI guard in a stealth game should spot the player only if the player is inside the guard's vision cone. Inputs: eye_pos, guard_forward, player_pos, half_fov_deg. Return a valid JSON DSL program, not a computed answer.",
  "target": {
    "__type__": "dsl",
    "required_ops": ["sub", "normalize", "dot", "cos_deg", "gte"],
    "tests": [
      {
        "inputs": {
          "eye_pos": [0, 0, 0],
          "guard_forward": [0, 0, -1],
          "player_pos": [0, 0, -5],
          "half_fov_deg": 30
        },
        "expected": true
      }
    ]
  },
  "category": "Linear Algebra & Vector Methods",
  "subcategory": "Dot Product & Angles"
}

Prompt format and answer protocol

Every sample is evaluated with a shared system prompt defined in spatial_eval.py. Models are instructed to reason normally but end with a final line of the form:

ANSWER: <value>

The scorer extracts the last ANSWER: block from the completion and evaluates only that answer payload.

Supported answer forms include:

Numbers
Booleans
Strings
3D coordinates
Lists of coordinates
JSON objects
Executable JSON DSL programs

Scoring and answer types

The benchmark uses a typed scorer rather than plain string matching. Floating-point comparisons use an absolute tolerance of 0.01, so tiny engine float residue like -1.19209e-07 is treated as zero.

Basic answer types

Type	Matching rule
`bool`	Accepts boolean JSON or equivalent `true` / `false` string answers
`number`	Absolute numeric tolerance
`string`	Case-insensitive match after normalization
`list[float]`	Element-wise numeric comparison
`list[list[float]]`	Nested element-wise numeric comparison
`dict`	Recursive key and value comparison, case-insensitive on keys

Custom validation types

The scorer also supports spatially aware validators:

Validator	Purpose
`vector3_constraints`	Accept any 3D coordinate satisfying exact-axis or min/max constraints
`degrees_between`	Accept a 3D vector if it falls within an angular threshold of an expected direction
`unordered_coordinate_set`	Accept a set of coordinates regardless of output order
`wall_frame_layout`	Validate frame placements against wall membership, bounds, and non-overlap
`ui_layout_changes`	Validate world-space UI edits by projecting into camera space and checking visibility/non-overlap
`dsl`	Parse, statically inspect, execute, and test a JSON method program

Some questions use validation as an alternate scoring target. In those cases the raw target remains useful as a reference answer, but the actual score is decided by the validator payload.

The JSON DSL

The DSL questions in this benchmark are designed to distinguish between:

Models that can compute one-off numeric answers
Models that can express reusable spatial procedures

Each DSL answer must be a single JSON object with:

steps: an array of step objects
return: the final variable name, literal, or nested expression to return

Each step has the form:

{"out":"tmp_name","op":"op_name","args":[...]}

The evaluator currently supports these operations:

add, sub, mul, div, dot, cross, norm, length, normalize,
sin_deg, cos_deg, acos_deg, atan2_deg, abs, vec3,
gte, lte, gt, lt, eq

Scoring for DSL questions has two parts:

The returned program must contain the required operations for that task.
The program must execute correctly against the bundled test cases.

Context-backed questions

10 questions use bundled scene exports as prompt context. For those questions, the loader reads the referenced JSON file and prepends it to the question text before evaluation.

This enables tasks such as:

Choosing placements based on an existing scene
Reasoning about parent-child relationships in real scene graphs
Validating world-space UI layouts
Deriving vectors from actual joint or object positions

The scene exports currently included in this repo are:

Editor

The repo includes a local dataset editor served by editor.py.

Start it with:

source .venv/bin/activate
python editor.py

Then open http://localhost:7878.

Current editor features include:

Search and category filtering
Inline editing of question text, target JSON, category, and subcategory
Add, update, and delete operations against v1.json
Context file selection with live JSON preview
Deep-linking to a question with ?q=<number>
Keyboard save with Cmd+S / Ctrl+S

The editor is meant for maintaining the dataset locally. It is not intended as a production web app.

Static viewer

index.html is a static HTML viewer for browsing benchmark questions and reference answers. It is useful for quick inspection, review, or printing.

Open it directly in a browser:

open index.html

or just double-click the file in your file browser.

Extending the benchmark

If you add new questions, try to preserve the benchmark's current design principles:

Questions should have concrete, machine-checkable answers
Spatial reasoning should be the core difficulty, not trivia or vague preference
Prompt wording should be explicit about coordinate systems, handedness, axes, and conventions
Tasks should stay grounded in realistic agent or scene-manipulation scenarios
Validators should be used when multiple answers are legitimately acceptable

When adding context-backed questions:

Keep context files in-repo
Reference them with context_file
Prefer small, stable exports over huge project folders
Use context_label to make prompt injection readable

When adding DSL questions:

Keep the required operation set minimal but meaningful
Include multiple test cases
Make the task require procedure synthesis, not just arithmetic

Benchmark outputs

The Inspect scorer reports:

Accuracy
Bootstrap standard error
Per-sample explanations that include category and subcategory labels

If you run via inspect view, those labels make it easy to inspect failure modes by question family.

Notes

The benchmark currently ships as v1.json
The scorer is intentionally strict about final answer formatting
Historical run notes live in results.md, but those are not part of the benchmark definition

If you publish results from this repo, it is a good idea to report the exact commit hash, dataset version, model string, and any prompt or system-message modifications you made.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial Benchmark

Why this benchmark exists

At a glance

What the benchmark covers

Example questions

Repository layout

Quickstart

1. Create an environment

2. Set provider credentials

3. Run the benchmark

Helper scripts

Dataset format

Prompt format and answer protocol

Scoring and answer types

Basic answer types

Custom validation types

The JSON DSL

Context-backed questions

Editor

Static viewer

Extending the benchmark

Benchmark outputs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
logs		logs
.gitignore		.gitignore
README.md		README.md
editor.html		editor.html
editor.py		editor.py
index.html		index.html
requirements.txt		requirements.txt
results.md		results.md
run_benchmark.sh		run_benchmark.sh
sampe-ui-scene.json		sampe-ui-scene.json
sample-scene.json		sample-scene.json
spatial_eval.py		spatial_eval.py
v1.json		v1.json

Folders and files

Latest commit

History

Repository files navigation

Spatial Benchmark

Why this benchmark exists

At a glance

What the benchmark covers

Example questions

Repository layout

Quickstart

1. Create an environment

2. Set provider credentials

3. Run the benchmark

Helper scripts

Dataset format

Prompt format and answer protocol

Scoring and answer types

Basic answer types

Custom validation types

The JSON DSL

Context-backed questions

Editor

Static viewer

Extending the benchmark

Benchmark outputs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages