Skip to content

specs-devs/spatial-benchmark

Repository files navigation

Spatial Benchmark

Spatial Benchmark is a benchmark for measuring how well language models reason about 3D space.

It focuses on the kinds of spatial tasks that show up in games, XR, simulation, scene generation, UI placement, and agent tooling: coordinates, transforms, hierarchy effects, visibility, collision-style constraints, and vector math. The benchmark is built on Inspect AI and packaged as a single Inspect task with a custom typed scorer.

This repository is intended to be publishable and reusable. It includes:

  • A benchmark dataset with 80 questions across 7 categories
  • A custom evaluator that supports multiple spatial answer formats
  • Context-backed questions that inject real scene JSON into the prompt
  • An executable JSON DSL for vector-method tasks
  • A browser-based editor for maintaining the dataset
  • Small helper scripts for running evaluations and browsing results

Why this benchmark exists

Many model benchmarks test language, coding, or generic reasoning, but spatial reasoning is usually underrepresented or mixed into broader suites. This benchmark is designed to isolate that skill and make it inspectable.

The goal is not to test whether a model "sounds spatial". The goal is to test whether it can return correct, machine-checkable answers for concrete spatial problems.

At a glance

  • 80 benchmark questions in v1.json
  • 7 top-level categories
  • 10 questions that use bundled scene JSON as runtime context
  • 11 questions that require returning an executable JSON DSL program instead of a computed answer
  • Built as an Inspect AI task in spatial_eval.py
  • Editable through a local browser UI served by editor.py

What the benchmark covers

Category Questions What it covers
Coordinate & Geometric Math 10 Scene placement, interpolation, scale, proportion, compositional placement
Transformations & Perspective 13 Local vs. world space, coordinate systems, egocentric placement, Euler reasoning
Constraint Satisfaction 10 Packing, layout constraints, non-overlap, geometric feasibility
Spatial Arrangement & Layout 13 World-space arrangement and world-space UI layout
Relational & Topological 11 Occlusion, frustum culling, directional relations
Hierarchical Structuring 11 Parent-child semantics, destruction effects, pivot placement
Linear Algebra & Vector Methods 12 Dot/cross products, reflection, projection, steering, signed angles

Some questions are pure numeric geometry. Others use exported scene graphs from sample-scene.json and sampe-ui-scene.json. The benchmark also includes tasks where the model must synthesize a small JSON program that is executed and tested by the scorer.

Example questions

Coordinate geometry

Write the coordinates for 12 objects evenly spaced on the upper semicircle of radius 10 centered at the origin on the XZ plane. Use Y=0 for every object, include both endpoints at (-10, 0, 0) and (10, 0, 0), and list the coordinates in order from left endpoint to right endpoint.

ANSWER: [[-10.0, 0, 0], [-9.595, 0, 2.817], ..., [10.0, 0, 0]]

Context-backed scene question

Where would be the best place to put a chair for someone using the Desk? Give your answer as [x, y, z].

This question is paired with sample-scene.json, which is injected into the prompt at evaluation time.

ANSWER: [0, 0, -100]

Layout validation

Please arrange the UI elements so that every movable element is fully visible and no two movable elements overlap. You may change only the local positions of Searchbar, Tile1, Tile2, and Tile3. Do not move or rotate Backplate. Return only the objects that need to change.

ANSWER: {
  "Searchbar": { "position": { "x": 0, "y": 22.091691970825195, "z": 1 } },
  "Tile3": { "position": { "x": 25, "y": -5.365461349487305, "z": 3 } }
}

Executable vector-method task

An AI guard in a stealth game should spot the player only if the player is inside the guard's vision cone. Inputs: eye_pos, guard_forward, player_pos, half_fov_deg. Return a valid JSON DSL program, not a computed answer.

ANSWER: {
  "return": "visible",
  "steps": [
    {"out":"to_player","op":"sub","args":["player_pos","eye_pos"]},
    {"out":"dir_n","op":"normalize","args":["to_player"]},
    {"out":"guard_forward_n","op":"normalize","args":["guard_forward"]},
    {"out":"score","op":"dot","args":["guard_forward_n","dir_n"]},
    {"out":"limit","op":"cos_deg","args":["half_fov_deg"]},
    {"out":"visible","op":"gte","args":["score","limit"]}
  ]
}

Repository layout

spatial-benchmark/
├── spatial_eval.py          # Inspect AI task, loader, scorer, answer validators
├── v1.json                  # Benchmark dataset
├── editor.py                # Local dataset editor server
├── editor.html              # Browser UI for editing questions
├── index.html               # Static viewer for browsing questions and answers
├── sample-scene.json        # Bundled scene export used by context-backed questions
├── sampe-ui-scene.json      # Bundled UI scene export used by context-backed questions
├── run_benchmark.sh         # Canonical benchmark runner with logs + run metadata
├── results.md               # Historical notes from earlier runs
└── requirements.txt         # Core Python dependencies

Quickstart

1. Create an environment

git clone https://github.com/specs-devs/spatial-benchmark.git
cd spatial-benchmark

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If you want to use the local editor UI, also install:

pip install fastapi uvicorn

2. Set provider credentials

Set the API keys for whichever model providers you want to use. For example:

export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...

Depending on the model strings you use, you may also need credentials for other providers referenced in the helper scripts.

3. Run the benchmark

inspect eval spatial_eval.py@spatial_benchmark --model openai/gpt-5.4

You can also run multiple models in one command:

inspect eval spatial_eval.py@spatial_benchmark \
  --model anthropic/claude-sonnet-4-6 \
          openai/gpt-5.4 \
          google/gemini-3-flash-preview

Then open the Inspect results UI:

inspect view

Helper scripts

One shell helper is included:

  • run_benchmark.sh is the canonical runner. It loads .env, runs models one at a time, stores raw stdout logs plus Inspect logs under results/<timestamp>/, and writes run metadata and a per-model status summary for later analysis.

These scripts are convenience wrappers, not part of the benchmark definition itself. Feel free to edit them to match your own provider setup and preferred model list.

Dataset format

The dataset lives in v1.json. Each entry is a JSON object with these fields:

Field Required Description
input Yes The benchmark question shown to the model
target Yes The canonical typed target or validator payload
category Yes Top-level category
subcategory Yes More specific grouping
context_file No Relative path to a bundled JSON file to inject into the prompt
context_label No Human-readable label for the injected context
validation No Optional alternate validator target used instead of raw target

Example question records:

{
  "input": "Where would be the best place to put a chair for someone using the `Desk`? Give your answer as [x, y, z].",
  "target": [0, 0, -100],
  "validation": {
    "__type__": "vector3_constraints",
    "x": {"value": 0},
    "y": {"value": 0},
    "z": {"min": -130, "max": -70}
  },
  "category": "Coordinate & Geometric Math",
  "subcategory": "Scene Placement",
  "context_file": "sample-scene.json",
  "context_label": "`sample-scene` sample scene"
}
{
  "input": "An AI guard in a stealth game should spot the player only if the player is inside the guard's vision cone. Inputs: eye_pos, guard_forward, player_pos, half_fov_deg. Return a valid JSON DSL program, not a computed answer.",
  "target": {
    "__type__": "dsl",
    "required_ops": ["sub", "normalize", "dot", "cos_deg", "gte"],
    "tests": [
      {
        "inputs": {
          "eye_pos": [0, 0, 0],
          "guard_forward": [0, 0, -1],
          "player_pos": [0, 0, -5],
          "half_fov_deg": 30
        },
        "expected": true
      }
    ]
  },
  "category": "Linear Algebra & Vector Methods",
  "subcategory": "Dot Product & Angles"
}

Prompt format and answer protocol

Every sample is evaluated with a shared system prompt defined in spatial_eval.py. Models are instructed to reason normally but end with a final line of the form:

ANSWER: <value>

The scorer extracts the last ANSWER: block from the completion and evaluates only that answer payload.

Supported answer forms include:

  • Numbers
  • Booleans
  • Strings
  • 3D coordinates
  • Lists of coordinates
  • JSON objects
  • Executable JSON DSL programs

Scoring and answer types

The benchmark uses a typed scorer rather than plain string matching. Floating-point comparisons use an absolute tolerance of 0.01, so tiny engine float residue like -1.19209e-07 is treated as zero.

Basic answer types

Type Matching rule
bool Accepts boolean JSON or equivalent true / false string answers
number Absolute numeric tolerance
string Case-insensitive match after normalization
list[float] Element-wise numeric comparison
list[list[float]] Nested element-wise numeric comparison
dict Recursive key and value comparison, case-insensitive on keys

Custom validation types

The scorer also supports spatially aware validators:

Validator Purpose
vector3_constraints Accept any 3D coordinate satisfying exact-axis or min/max constraints
degrees_between Accept a 3D vector if it falls within an angular threshold of an expected direction
unordered_coordinate_set Accept a set of coordinates regardless of output order
wall_frame_layout Validate frame placements against wall membership, bounds, and non-overlap
ui_layout_changes Validate world-space UI edits by projecting into camera space and checking visibility/non-overlap
dsl Parse, statically inspect, execute, and test a JSON method program

Some questions use validation as an alternate scoring target. In those cases the raw target remains useful as a reference answer, but the actual score is decided by the validator payload.

The JSON DSL

The DSL questions in this benchmark are designed to distinguish between:

  • Models that can compute one-off numeric answers
  • Models that can express reusable spatial procedures

Each DSL answer must be a single JSON object with:

  • steps: an array of step objects
  • return: the final variable name, literal, or nested expression to return

Each step has the form:

{"out":"tmp_name","op":"op_name","args":[...]}

The evaluator currently supports these operations:

add, sub, mul, div, dot, cross, norm, length, normalize,
sin_deg, cos_deg, acos_deg, atan2_deg, abs, vec3,
gte, lte, gt, lt, eq

Scoring for DSL questions has two parts:

  1. The returned program must contain the required operations for that task.
  2. The program must execute correctly against the bundled test cases.

Context-backed questions

10 questions use bundled scene exports as prompt context. For those questions, the loader reads the referenced JSON file and prepends it to the question text before evaluation.

This enables tasks such as:

  • Choosing placements based on an existing scene
  • Reasoning about parent-child relationships in real scene graphs
  • Validating world-space UI layouts
  • Deriving vectors from actual joint or object positions

The scene exports currently included in this repo are:

Editor

The repo includes a local dataset editor served by editor.py.

Start it with:

source .venv/bin/activate
python editor.py

Then open http://localhost:7878.

Current editor features include:

  • Search and category filtering
  • Inline editing of question text, target JSON, category, and subcategory
  • Add, update, and delete operations against v1.json
  • Context file selection with live JSON preview
  • Deep-linking to a question with ?q=<number>
  • Keyboard save with Cmd+S / Ctrl+S

The editor is meant for maintaining the dataset locally. It is not intended as a production web app.

Static viewer

index.html is a static HTML viewer for browsing benchmark questions and reference answers. It is useful for quick inspection, review, or printing.

Open it directly in a browser:

open index.html

or just double-click the file in your file browser.

Extending the benchmark

If you add new questions, try to preserve the benchmark's current design principles:

  • Questions should have concrete, machine-checkable answers
  • Spatial reasoning should be the core difficulty, not trivia or vague preference
  • Prompt wording should be explicit about coordinate systems, handedness, axes, and conventions
  • Tasks should stay grounded in realistic agent or scene-manipulation scenarios
  • Validators should be used when multiple answers are legitimately acceptable

When adding context-backed questions:

  • Keep context files in-repo
  • Reference them with context_file
  • Prefer small, stable exports over huge project folders
  • Use context_label to make prompt injection readable

When adding DSL questions:

  • Keep the required operation set minimal but meaningful
  • Include multiple test cases
  • Make the task require procedure synthesis, not just arithmetic

Benchmark outputs

The Inspect scorer reports:

  • Accuracy
  • Bootstrap standard error
  • Per-sample explanations that include category and subcategory labels

If you run via inspect view, those labels make it easy to inspect failure modes by question family.

Notes

  • The benchmark currently ships as v1.json
  • The scorer is intentionally strict about final answer formatting
  • Historical run notes live in results.md, but those are not part of the benchmark definition

If you publish results from this repo, it is a good idea to report the exact commit hash, dataset version, model string, and any prompt or system-message modifications you made.

About

Public bencmark for measuring spatial intelligence in LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors