Spatial Benchmark is a benchmark for measuring how well language models reason about 3D space.
It focuses on the kinds of spatial tasks that show up in games, XR, simulation, scene generation, UI placement, and agent tooling: coordinates, transforms, hierarchy effects, visibility, collision-style constraints, and vector math. The benchmark is built on Inspect AI and packaged as a single Inspect task with a custom typed scorer.
This repository is intended to be publishable and reusable. It includes:
- A benchmark dataset with 80 questions across 7 categories
- A custom evaluator that supports multiple spatial answer formats
- Context-backed questions that inject real scene JSON into the prompt
- An executable JSON DSL for vector-method tasks
- A browser-based editor for maintaining the dataset
- Small helper scripts for running evaluations and browsing results
Many model benchmarks test language, coding, or generic reasoning, but spatial reasoning is usually underrepresented or mixed into broader suites. This benchmark is designed to isolate that skill and make it inspectable.
The goal is not to test whether a model "sounds spatial". The goal is to test whether it can return correct, machine-checkable answers for concrete spatial problems.
- 80 benchmark questions in
v1.json - 7 top-level categories
- 10 questions that use bundled scene JSON as runtime context
- 11 questions that require returning an executable JSON DSL program instead of a computed answer
- Built as an Inspect AI task in
spatial_eval.py - Editable through a local browser UI served by
editor.py
| Category | Questions | What it covers |
|---|---|---|
| Coordinate & Geometric Math | 10 | Scene placement, interpolation, scale, proportion, compositional placement |
| Transformations & Perspective | 13 | Local vs. world space, coordinate systems, egocentric placement, Euler reasoning |
| Constraint Satisfaction | 10 | Packing, layout constraints, non-overlap, geometric feasibility |
| Spatial Arrangement & Layout | 13 | World-space arrangement and world-space UI layout |
| Relational & Topological | 11 | Occlusion, frustum culling, directional relations |
| Hierarchical Structuring | 11 | Parent-child semantics, destruction effects, pivot placement |
| Linear Algebra & Vector Methods | 12 | Dot/cross products, reflection, projection, steering, signed angles |
Some questions are pure numeric geometry. Others use exported scene graphs from sample-scene.json and sampe-ui-scene.json. The benchmark also includes tasks where the model must synthesize a small JSON program that is executed and tested by the scorer.
Coordinate geometry
Write the coordinates for 12 objects evenly spaced on the upper semicircle of radius 10 centered at the origin on the XZ plane. Use Y=0 for every object, include both endpoints at (-10, 0, 0) and (10, 0, 0), and list the coordinates in order from left endpoint to right endpoint.
ANSWER: [[-10.0, 0, 0], [-9.595, 0, 2.817], ..., [10.0, 0, 0]]
Context-backed scene question
Where would be the best place to put a chair for someone using the
Desk? Give your answer as [x, y, z].
This question is paired with sample-scene.json, which is injected into the prompt at evaluation time.
ANSWER: [0, 0, -100]
Layout validation
Please arrange the UI elements so that every movable element is fully visible and no two movable elements overlap. You may change only the local positions of
Searchbar,Tile1,Tile2, andTile3. Do not move or rotateBackplate. Return only the objects that need to change.
ANSWER: {
"Searchbar": { "position": { "x": 0, "y": 22.091691970825195, "z": 1 } },
"Tile3": { "position": { "x": 25, "y": -5.365461349487305, "z": 3 } }
}Executable vector-method task
An AI guard in a stealth game should spot the player only if the player is inside the guard's vision cone. Inputs:
eye_pos,guard_forward,player_pos,half_fov_deg. Return a valid JSON DSL program, not a computed answer.
ANSWER: {
"return": "visible",
"steps": [
{"out":"to_player","op":"sub","args":["player_pos","eye_pos"]},
{"out":"dir_n","op":"normalize","args":["to_player"]},
{"out":"guard_forward_n","op":"normalize","args":["guard_forward"]},
{"out":"score","op":"dot","args":["guard_forward_n","dir_n"]},
{"out":"limit","op":"cos_deg","args":["half_fov_deg"]},
{"out":"visible","op":"gte","args":["score","limit"]}
]
}spatial-benchmark/
├── spatial_eval.py # Inspect AI task, loader, scorer, answer validators
├── v1.json # Benchmark dataset
├── editor.py # Local dataset editor server
├── editor.html # Browser UI for editing questions
├── index.html # Static viewer for browsing questions and answers
├── sample-scene.json # Bundled scene export used by context-backed questions
├── sampe-ui-scene.json # Bundled UI scene export used by context-backed questions
├── run_benchmark.sh # Canonical benchmark runner with logs + run metadata
├── results.md # Historical notes from earlier runs
└── requirements.txt # Core Python dependencies
git clone https://github.com/specs-devs/spatial-benchmark.git
cd spatial-benchmark
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf you want to use the local editor UI, also install:
pip install fastapi uvicornSet the API keys for whichever model providers you want to use. For example:
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...Depending on the model strings you use, you may also need credentials for other providers referenced in the helper scripts.
inspect eval spatial_eval.py@spatial_benchmark --model openai/gpt-5.4You can also run multiple models in one command:
inspect eval spatial_eval.py@spatial_benchmark \
--model anthropic/claude-sonnet-4-6 \
openai/gpt-5.4 \
google/gemini-3-flash-previewThen open the Inspect results UI:
inspect viewOne shell helper is included:
run_benchmark.shis the canonical runner. It loads.env, runs models one at a time, stores raw stdout logs plus Inspect logs underresults/<timestamp>/, and writes run metadata and a per-model status summary for later analysis.
These scripts are convenience wrappers, not part of the benchmark definition itself. Feel free to edit them to match your own provider setup and preferred model list.
The dataset lives in v1.json. Each entry is a JSON object with these fields:
| Field | Required | Description |
|---|---|---|
input |
Yes | The benchmark question shown to the model |
target |
Yes | The canonical typed target or validator payload |
category |
Yes | Top-level category |
subcategory |
Yes | More specific grouping |
context_file |
No | Relative path to a bundled JSON file to inject into the prompt |
context_label |
No | Human-readable label for the injected context |
validation |
No | Optional alternate validator target used instead of raw target |
Example question records:
{
"input": "Where would be the best place to put a chair for someone using the `Desk`? Give your answer as [x, y, z].",
"target": [0, 0, -100],
"validation": {
"__type__": "vector3_constraints",
"x": {"value": 0},
"y": {"value": 0},
"z": {"min": -130, "max": -70}
},
"category": "Coordinate & Geometric Math",
"subcategory": "Scene Placement",
"context_file": "sample-scene.json",
"context_label": "`sample-scene` sample scene"
}{
"input": "An AI guard in a stealth game should spot the player only if the player is inside the guard's vision cone. Inputs: eye_pos, guard_forward, player_pos, half_fov_deg. Return a valid JSON DSL program, not a computed answer.",
"target": {
"__type__": "dsl",
"required_ops": ["sub", "normalize", "dot", "cos_deg", "gte"],
"tests": [
{
"inputs": {
"eye_pos": [0, 0, 0],
"guard_forward": [0, 0, -1],
"player_pos": [0, 0, -5],
"half_fov_deg": 30
},
"expected": true
}
]
},
"category": "Linear Algebra & Vector Methods",
"subcategory": "Dot Product & Angles"
}Every sample is evaluated with a shared system prompt defined in spatial_eval.py. Models are instructed to reason normally but end with a final line of the form:
ANSWER: <value>
The scorer extracts the last ANSWER: block from the completion and evaluates only that answer payload.
Supported answer forms include:
- Numbers
- Booleans
- Strings
- 3D coordinates
- Lists of coordinates
- JSON objects
- Executable JSON DSL programs
The benchmark uses a typed scorer rather than plain string matching. Floating-point comparisons use an absolute tolerance of 0.01, so tiny engine float residue like -1.19209e-07 is treated as zero.
| Type | Matching rule |
|---|---|
bool |
Accepts boolean JSON or equivalent true / false string answers |
number |
Absolute numeric tolerance |
string |
Case-insensitive match after normalization |
list[float] |
Element-wise numeric comparison |
list[list[float]] |
Nested element-wise numeric comparison |
dict |
Recursive key and value comparison, case-insensitive on keys |
The scorer also supports spatially aware validators:
| Validator | Purpose |
|---|---|
vector3_constraints |
Accept any 3D coordinate satisfying exact-axis or min/max constraints |
degrees_between |
Accept a 3D vector if it falls within an angular threshold of an expected direction |
unordered_coordinate_set |
Accept a set of coordinates regardless of output order |
wall_frame_layout |
Validate frame placements against wall membership, bounds, and non-overlap |
ui_layout_changes |
Validate world-space UI edits by projecting into camera space and checking visibility/non-overlap |
dsl |
Parse, statically inspect, execute, and test a JSON method program |
Some questions use validation as an alternate scoring target. In those cases the raw target remains useful as a reference answer, but the actual score is decided by the validator payload.
The DSL questions in this benchmark are designed to distinguish between:
- Models that can compute one-off numeric answers
- Models that can express reusable spatial procedures
Each DSL answer must be a single JSON object with:
steps: an array of step objectsreturn: the final variable name, literal, or nested expression to return
Each step has the form:
{"out":"tmp_name","op":"op_name","args":[...]}The evaluator currently supports these operations:
add, sub, mul, div, dot, cross, norm, length, normalize,
sin_deg, cos_deg, acos_deg, atan2_deg, abs, vec3,
gte, lte, gt, lt, eq
Scoring for DSL questions has two parts:
- The returned program must contain the required operations for that task.
- The program must execute correctly against the bundled test cases.
10 questions use bundled scene exports as prompt context. For those questions, the loader reads the referenced JSON file and prepends it to the question text before evaluation.
This enables tasks such as:
- Choosing placements based on an existing scene
- Reasoning about parent-child relationships in real scene graphs
- Validating world-space UI layouts
- Deriving vectors from actual joint or object positions
The scene exports currently included in this repo are:
The repo includes a local dataset editor served by editor.py.
Start it with:
source .venv/bin/activate
python editor.pyThen open http://localhost:7878.
Current editor features include:
- Search and category filtering
- Inline editing of question text, target JSON, category, and subcategory
- Add, update, and delete operations against
v1.json - Context file selection with live JSON preview
- Deep-linking to a question with
?q=<number> - Keyboard save with
Cmd+S/Ctrl+S
The editor is meant for maintaining the dataset locally. It is not intended as a production web app.
index.html is a static HTML viewer for browsing benchmark questions and reference answers. It is useful for quick inspection, review, or printing.
Open it directly in a browser:
open index.htmlor just double-click the file in your file browser.
If you add new questions, try to preserve the benchmark's current design principles:
- Questions should have concrete, machine-checkable answers
- Spatial reasoning should be the core difficulty, not trivia or vague preference
- Prompt wording should be explicit about coordinate systems, handedness, axes, and conventions
- Tasks should stay grounded in realistic agent or scene-manipulation scenarios
- Validators should be used when multiple answers are legitimately acceptable
When adding context-backed questions:
- Keep context files in-repo
- Reference them with
context_file - Prefer small, stable exports over huge project folders
- Use
context_labelto make prompt injection readable
When adding DSL questions:
- Keep the required operation set minimal but meaningful
- Include multiple test cases
- Make the task require procedure synthesis, not just arithmetic
The Inspect scorer reports:
- Accuracy
- Bootstrap standard error
- Per-sample explanations that include category and subcategory labels
If you run via inspect view, those labels make it easy to inspect failure modes by question family.
- The benchmark currently ships as
v1.json - The scorer is intentionally strict about final answer formatting
- Historical run notes live in
results.md, but those are not part of the benchmark definition
If you publish results from this repo, it is a good idea to report the exact commit hash, dataset version, model string, and any prompt or system-message modifications you made.