Seungwoo Kim*1 · Khai Loong Aw*1 · Klemen Kotar*1
Cristobal Eyzaguirre1 · Wanhee Lee1 · Yunong Liu1 · Jared Watrous1
Stefan Stojanov1 · Juan Carlos Niebles1 · Jiajun Wu1 · Daniel L. K. Yamins1
1Stanford
(* equal contribution)
We introduce KL-tracing, a novel test-time inference procedure that uses the Kullback-Leibler (KL) divergence of prediction logits for zero-shot extraction of optical flow from a generative video model without any additional task-specific fine-tuning. We obtain state-of-the-art point tracking / optical flow results when combined with the Local Random Access Sequence (LRAS) model.
conda create -n kl_tracing python=3.10
pip install uv
uv pip install -e .[dev]
# [Optional] for linting
pip install pre-commit
pre-commit install
pre-commit run --all-files
The evaluation script expects a JSON file that contains all evaluation points for TAP-Vid DAVIS.
- Download the TAP-Vid DAVIS pickle file here.
- Run the following
python preproc_tapvid.py \
--pkl_path <pkl_path> \
--img_root_dir data/davis_frames \
--json_path data/davis_dataset.json
This will save all the frames to img_root_dir
if they do not exist already (TAP-Vid DAVIS consists of 30 videos, each with varying number of frames), and create a JSON dataset at json_path
with all evaluation points. This JSON dataset can be used to run eval.py
.
- First, run the step above to get the full dataset.
- Run the following
python preproc_tapvid.py \
--json_path data/davis_dataset.json \
--sample
This will sample from the full dataset (as saved in json_path
) and output a smaller eval JSON in data/mini_davis_dataset.json
. The evaluation points will be chosen based on data/mini_davis_dataset_template.json
for reproducibility.
Run the following:
DEVICE=0 ./run_tapvid_davis.sh <start_idx> <num_points>
This will generate results in eval_out/tapvid_davis_results
for a subset of the dataset dataset[start_idx:start_idx+num_points
. Launch the script in parallel on different subsets of the data. Once all results are generated, you can run
python offline_eval.py \
--pkl_path <path_to_davis_dataset_pkl> \
--json_path <path_to_davis_dataset_json> \
--root_dirs <path_to_results_dir>
which will aggregate all the results and print out the final TAP-Vid metric.
Alternatively, when using the Mini TAP-Vid DAVIS evaluation, you can repeat the above with the script ./run_sampled_davis.sh
, and then run the following
python offline_eval.py \
--root_dirs <path_to_results_dir> \
--sampled
Using this script, we are able to achieve an end-point error (EPE) of 1.23 on the mini TAP-Vid DAVIS dataset.
If you find this project useful, please consider citing:
@misc{kim2025taminggenerativevideomodels,
title={Taming generative video models for zero-shot optical flow extraction},
author={Seungwoo Kim and Khai Loong Aw and Klemen Kotar and Cristobal Eyzaguirre and Wanhee Lee and Yunong Liu and Jared Watrous and Stefan Stojanov and Juan Carlos Niebles and Jiajun Wu and Daniel L. K. Yamins},
year={2025},
eprint={2507.09082},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.09082},
}