Rahul Venkatesh*1 Β· Klemen Kotar*1 Β· Lilian Naing Chen*1 Β· Simon Kim1Β· Luca Thomas Wheeler1 Β· Jared Watrous1 Β· Ashley Xu1 Β· Gia Ancone1 Β· Wanhee Lee1 Honglin Chen2 Β·Daniel Bear3 Β· Stefan Stojanov1 Β· Daniel L. K. Yamins1
1Stanfordββββ2OpenAIββββ3Noetik, Inc.
(* equal contribution)
SpelkeBench is a dataset of 500 images with associated segmentation annotations for evaluating models' ability to extract Spelke segments. Unlike conventional definitions of segmentation, Spelke segments provide a category-agnostic notion of segments based on what moves together in the physical world.
To download the dataset, run the following command from the top level of the repository, which will save the dataset inside the datasets/
folder:
sh scripts/download_spelke_bench.sh
The dataset is a .h5
file where each key contains a dictionary with the following entries:
image
: The input imagesegments_gt
: The ground truth Spelke segmentspoke_location
: The location of the virtual poke to be used by the model to generate the segments
Some examples from the dataset are shown below and comparisons to SAM and Entity segmentation are illustrated below.
To discover Spelke segments, we build SpelkeNet, a model that learns to complete flow fields and implicitly captures how objects move together in the physical world.
cd SpelkeNet
conda create -n spelkenet python=3.10 -y
conda activate spelkenet
pip install -e .
- Motion Affordance Maps: which estimate the regions likely to respond to an external force, independent of camera motion.
- Expected Displacement Maps: A flow field that captures how the scene will respond to an applied virtual poke.
We provide jupyter notebooks which demonstrate how these maps can be extracted from SpelkeNet:
Using these structure extractions, we first sample a location that is likely to move from the motion affordance map, and apply various virtual pokes at this location in order to identify regions that consistently move together. We then compute the expected motion correlation by averaging across various pokes the dot product between the poke vector and the expected displacement map. Finally, Otsu thresholding on the averaged dot product yields our desired Spelke segment.
We provide a notebook which demonstrates how to extract Spelke segments from SpelkeNet:
We provide scripts to run inference and evaluate segmentation models on the SpelkeBench dataset. To evaluate a model define a model class that inherits from:
spelke_net.inference.spelke_object_discovery.spelke_bench_class.SpelkeBenchModel
and implement the run_inference
method with the following signature:
class SpelkeBenchModel:
"""
Base class for SpelkeBench models.
This class should be inherited by all models that are used in the SpelkeBench framework.
"""
def __init__(self):
"""args to initialize the model"""
return
def run_inference(self, input_image, poke_point):
'''
Run inference on the input image and poke point.
:param input_image: numpy array of shape [H, W, 3] in [0, 1] range
:param poke_point: (x, y) tuple representing the poke point in the image, x horizontal, y vertical
:return: H, W numpy array representing the segment mask
'''
We provide a reference implementation of a SpelkeNet model in spelke_net.inference.spelke_object_discovery.spelke_bench_class.SpelkeNetModel1B
. This class is initialized with the following default values:
Parameter | Default Value | Description |
---|---|---|
num_zoom_iters |
2 |
Number of zoom-in refinement iterations. |
num_seq_patches |
256 |
Number of sequential flow token generations |
num_seeds |
3 |
Number of rollouts per virtual poke |
num_dirs |
8 |
Number of virtual pokes |
model_name |
model_1B.pt |
Name of the model checkpoint file to load. Refers to the 1-billion parameter SpelkeNet checkpoint. |
num_zoom_dirs |
5 |
Number of virtual pokes to be applied during the zooming in stage |
Use the following command to run inference on one or more images:
python spelke_net/inference/spelke_object_discovery/run_inference.py \
--device cuda:<device_id> \
--dataset_path ./datasets/spelke_bench.h5 \
--img_names entityseg_1_image2926 \
--output_dir <out_dir>
--model_name SpelkeNetModel1B
Flag | Description |
---|---|
--device |
Which GPU to use (e.g., cuda:0 ) |
--dataset_path |
Path to the SpelkeBench .h5 file |
--img_names |
Space-separated list of image keys in the .h5 file |
--output_dir |
Directory to save prediction outputs |
You can pass multiple image keys like:
--img_names entityseg_1_image2926 entityseg_2_image1258 ...
To run inference in parallel across multiple nodes and GPUs, we provide a wrapper script. Here's a typical setup assuming:
- 4 nodes are available
- each node has 4 GPUs
- you want to split the workload evenly across nodes
You would launch the following on each node with node-specific values:
python spelke_net/inference/spelke_object_discovery/run_inference_parallel.py \
--gpus 0 1 2 3 \
--dataset_path ./datasets/spelke_bench.h5 \
--output_dir <out_dir> \
--num_splits 4 \
--split_num <node_id> \
--model_name SpelkeNetModel1B
To use the larger 7B model, replace the model name:--model_name SpelkeNetModel7B
Each prediction is saved as a separate .h5
file in <out_dir>
, containing the following keys:
segment_pred
: predicted Spelke segment masksegment_gt
: ground truth Spelke segment maskprobe_point
: virtual poke locationimage
: original input image
After inference is complete, run the evaluation script:
python spelke_net/inference/spelke_object_discovery/evaluate_folder.py \
--input_dir <out_dir> \
--output_dir <metrics_out_dir>
This will:
- Save visualizations to
<metrics_out_dir>
- Print the following metrics to the console:
- Average Recall (AR)
- Mean Intersection-over-Union (mIOU)
Metric | SAM2 | DINOv1-B/8 | DINOv2-L/14 | DINOv2-G/14 | CWM | SpelkeNet |
---|---|---|---|---|---|---|
AR | 0.4816 | 0.2708 | 0.2524 | 0.2254 | 0.3271 | 0.5411 |
mIoU | 0.6225 | 0.4990 | 0.4931 | 0.4553 | 0.4807 | 0.6811 |
To evaluate a custom model:
- Implement a model class following the
SpelkeNetModel
interface in the filespelke_net/inference/spelke_object_discovery/spelke_bench_class.py
. - Pass its class name to the
--model_name
argument in the above commands.
git clone https://github.com/neuroailab/3DEditBench.git`
cd 3DEditBench
conda activate spelkenet
pip install -e . --no-deps
cd SpelkeNet
sh scripts/download_3DEditBench.sh
This command will download the precomputed segments into datasets/precomputed_segments/
sh scripts/download_3DEditBench_precomputed_segments.sh
To run inference in parallel across multiple nodes and GPUs, 3DEditBench provides a wrapper script similar to SpelkeBench evaluation.
For a cluster of 4 Nodes Γ 4 GPUs, on each node, launch:
editbench-launch \
--gpus 0 1 2 3 \
--dataset_path ./datasets/3DEditBench \
--output_dir ./experiments/my_model_run \
--num_splits 4 \
--split_num <node_id> \
--model_class <your_model.YourEditingModel>
Replace <node_id>
with the appropriate node index (0
, 1
, 2
, or 3
).
You can choose from the following predefined models by setting the --model_class
flag:
Model Class | Description |
---|---|
spelke_net.inference.object_manipulation.edit_model.ObjectEditModelSAM |
Uses the Segment Anything Model (SAM) to generate object masks (pre-computed) based on the point prompt. |
spelke_net.inference.object_manipulation.edit_model.ObjectEditModelSpelkeNet |
Uses SpelkeNet to infer motion-based object segments (pre-computed) from point prompts. |
spelke_net.inference.object_manipulation.edit_model.ObjectEditModelGT |
Uses the ground-truth segmentation masks provided in the 3DEditBench dataset. |
π Tip: After all splits finish, you can evaluate the results with the
editbench-evaluate-metrics
utilty on thehdf5_result_files/
directory.
After all splits finish, you can evaluate your modelβs aggregating performance on 3DEditBench
:
editbench-evaluate-metrics \
--predictions_path ./experiments/my_model_run/hdf5_result_files
--results_dir <your_results_dir>
Method | Segment | MSE β | PSNR β | LPIPS β | SSIM β | EA β |
---|---|---|---|---|---|---|
LRAS-3D | SpelkeNet | 0.009 | 21.64 | 0.213 | 0.698 | 0.776 |
SAM | 0.013 | 20.17 | 0.255 | 0.685 | 0.633 | |
LightningDrag | SpelkeNet | 0.017 | 19.16 | 0.195 | 0.672 | 0.679 |
SAM | 0.020 | 18.18 | 0.241 | 0.658 | 0.536 | |
Diffusion Handles | SpelkeNet | 0.024 | 17.42 | 0.364 | 0.555 | 0.576 |
SAM | 0.031 | 16.15 | 0.419 | 0.526 | 0.495 | |
Diffusion as Shader | SpelkeNet | 0.015 | 19.29 | 0.194 | 0.707 | 0.640 |
SAM | 0.019 | 18.20 | 0.253 | 0.682 | 0.503 |
Use the command below to apply custom object edits with either SAM or SpelkeNet segments by setting the segment_type
flag. You can control the objectβs rotation (azimuth, elevation, tilt) and translation (tx, ty, tz) in 3βD space. These transformations are specified in the world coordinate system, where azimuth controls rotation around the axis vertical to the ground plane.
CUDA_VISIBLE_DEVICES=<gpu_id> \
python spelke_net/inference/object_manipulation/custom_object_edits.py \
--hdf5_file ./datasets/3DEditBench/0005.hdf5 \
--segment_hdf5_file ./datasets/precomputed_segments/0005.hdf5 \
--output_dir ./experiments/test_custom \
--segment_type spelkenet \
--azimuth 0.0 --elevation 0.0 --tilt 0.0 \
--tx 0.15 --ty 0.0 --tz 0.0 \
--num_runs 1
Flag | Description |
---|---|
--hdf5_file |
Path to the input scene (.hdf5 ) containing RGB frames and GT 3D edits |
--segment_hdf5_file |
Path to the segmentation masks (.hdf5 ) for the same scene. |
--segment_type |
Choose which masks to use: sam or spelkenet or GT . |
--output_dir |
Root directory for this runβs outputs. A viz/ subfolder is created automatically. |
--azimuth , --elevation , --tilt |
Rotations (degrees): yaw, pitch, and roll, respectively. |
--tx , --ty , --tz |
Translations along X, Y, Z axes in meters. |
--num_runs |
Number of random generations for the same 3D transform |
Output format: This command will reproduce the result in the teaser image, and saves visualisations in
./experiments/test_custom/viz/
If you use this code or the dataset in your research, please cite our paper:
@misc{venkatesh2025discoveringusingspelkesegments,
title={Discovering and using Spelke segments},
author={Rahul Venkatesh and Klemen Kotar and Lilian Naing Chen and Seungwoo Kim and Luca Thomas Wheeler and Jared Watrous and Ashley Xu and Gia Ancone and Wanhee Lee and Honglin Chen and Daniel Bear and Stefan Stojanov and Daniel Yamins},
year={2025},
eprint={2507.16038},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.16038},
}