We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset containing over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets.
To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference.
Building on this resource, we train TRASER, a video scene graph generation model that augments VLM with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass.
On PVSG, VIPSeg, VidOR and SVG2test, TRASER improves relation detection by +15~20%, object prediction by +30~40% over the strongest open-source baselines. When TRASER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5~4.6% absolute accuracy gain, demonstrating the utility of explicit spatio-temporal scene graphs.
Our fully automated pipeline integrates SAM2, Describe Anything Model (DAM), and GPT-5 to produce dense, temporally grounded video scene graphs:
Human verification on 100 sampled videos shows 93.8% accuracy for object labels, 79.0% for attributes, and 85.4% for relations.
SVG2 is the first large-scale video scene graph dataset with dense panoptic annotations. We sample 43K videos from SA-V and 593K videos from PVD, resulting in 6.6M object instances, 52M attributes, and 6.7M spatiotemporal relations.
Additionally, we construct SVG2test, a human-annotated benchmark of 100 videos with multi-granularity panoptic annotations following hierarchical human visual perception.
| Dataset | #Videos | Annotator | Type | Frame/ Vid | #Obj/ #Traj | Obj Cls | #Relations | Rel Cls | #Attributes |
|---|---|---|---|---|---|---|---|---|---|
| SA-V | 50.9K | SAM-2 + Human | SegS | 330 | 0.6M | - | - | - | - |
| VIPSeg | 2.8K | Human | SegS | 23 | 38.2K | 124 | - | - | - |
| VidVRD | 0.8K | Human | BoxS | 304 | 2.4K | 35 | 25.9K | 132 | - |
| Action Genome | 7.8K | Human | BoxS | 808 | 26.2K | 35 | 1.4M* | 26 | - |
| VidOR | 7.0K | Human | BoxD | 1K | 34.6K | 80 | 0.3M | 50 | - |
| PVSG | 338 | Human | SegS | 375 | 6.3K | 125 | 3.6K | 62 | - |
| SVG2 (Ours) | 636K | Our Pipeline | SegD | 479 | 6.6M | 54.2K | 6.7M | 35.3K | 52M |
| SVG2test | 100 | Human | SegD | 160 | 3.2K | 749 | 3.3K | 249 | 9.7K |
TRASER is a VLM that produces structured video scene graphs in one forward pass from raw videos and panoptic object trajectories.
The dual resampler design balances global object context with local temporal grounding, enabling accurate prediction of both object attributes and temporally-localized relations.
We evaluate TRASER against leading proprietary and open-source VLMs on four benchmarks using panoptic object trajectories as input. TRASER outperforms all open-source baselines and surpasses GPT-5 in both object and attribute prediction.
| Model | Triplet | Relation | Object | Attr | Rank | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PVSG | VidOR | SVG2t | PVSG | VidOR | SVG2t | VIPSeg | PVSG | VidOR | SVG2t | SVG2t | ||
| Proprietary Models (API) | ||||||||||||
| GPT-4.1 | 6.0 | 10.8 | 6.4 | 7.3 | 11.9 | 7.5 | 59.9 | 51.4 | 86.6 | 58.5 | 15.8 | 3 |
| Gemini-2.5 Pro | 7.4 | 9.8 | 8.7 | 8.8 | 11.0 | 9.9 | 56.7 | 31.7 | 82.9 | 49.8 | 13.6 | 4 |
| GPT-5 | 16.6 | 19.7 | 17.9 | 18.3 | 21.7 | 19.4 | 68.1 | 54.2 | 88.5 | 65.5 | 24.1 | 2 |
| Open-Source Models | ||||||||||||
| Qwen2.5-VL-3B | 0.1 | 0.2 | 0.2 | 0.1 | 0.4 | 0.3 | 22.1 | 10.4 | 45.0 | 24.2 | 1.4 | 12 |
| MiniCPM-V 4.5 | 0.1 | 3.0 | 1.1 | 0.2 | 4.0 | 2.4 | 40.0 | 14.3 | 59.1 | 38.5 | 8.4 | 8 |
| InternVL3.5-4B | 0.2 | 0.4 | 0.1 | 0.3 | 0.5 | 0.2 | 33.7 | 20.4 | 66.4 | 35.0 | 7.4 | 11 |
| GLM-4.1-9B-Thinking | 0.3 | 3.9 | 1.8 | 0.5 | 5.0 | 2.9 | 46.5 | 17.8 | 61.1 | 28.5 | 9.1 | 6 |
| Qwen3-VL-4B | 0.1 | 0.7 | 1.4 | 0.1 | 0.8 | 1.6 | 34.1 | 21.8 | 65.8 | 35.6 | 8.3 | 10 |
| Qwen3-VL-4B-Thinking | 0.1 | 2.3 | 3.3 | 0.4 | 3.4 | 3.6 | 35.8 | 18.3 | 67.6 | 37.1 | 8.8 | 7 |
| FT-Qwen2.5-VL-3B (1st Bbox) | 0.1 | 1.6 | 0.1 | 0.9 | 4.5 | 0.5 | 25.5 | 25.9 | 51.7 | 36.7 | 10.4 | 9 |
| FT-Qwen2.5-VL-3B (Bbox Traj.) | 0.5 | 1.8 | 1.4 | 1.6 | 4.2 | 3.0 | 35.1 | 33.6 | 56.9 | 46.1 | 13.4 | 5 |
| TRASER (Ours) | 16.1 | 22.9 | 16.7 | 16.9 | 25.0 | 18.7 | 86.5 | 72.7 | 91.4 | 79.0 | 27.1 | 1 |
We assess the utility of structured video scene graphs for downstream video QA tasks. High-quality scene graphs from TRASER consistently improve GPT-4.1's VQA accuracy compared to video-only or Qwen2.5-VL augmented inputs.
| Benchmark | Video Only | Video + Qwen2.5-VL's VSG | Video + TRASER's VSG |
|---|---|---|---|
| AGQA 2.0 | 25.9 | 24.8 | 26.3 |
| Perception-Test | 66.8 | 68.6 | 71.4 |
@article{gao2026svg2,
author = {Gao, Ziqi and Zhang, Jieyu and Ikezogwo, Wisdom Oluchi and Park, Jae Sung and You, Tario G and Ogbu, Daniel and Zheng, Chenhao and Huang, Weikai and Yang, Yinuo and Han, Winson and Kong, Quan and Saini, Rajat and Krishna, Ranjay},
title = {Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
year = {2026},
}