Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao¹, Jieyu Zhang^1,2, Wisdom Oluchi Ikezogwo^1,2, Jae Sung Park¹, Tario G You², Daniel Ogbu², Chenhao Zheng^1,2 Weikai Huang² Yinuo Yang² Winson Han¹ Quan Kong³ Rajat Saini³ Ranjay Krishna^1,2

¹Allen Institute for AI ²University of Washington, ³Woven by Toyota

Paper 🤗 Model Data Code (Coming soon)

Synthetic Visual Genome 2 (SVG2) provides large-scale video scene graphs with 636K videos, 6.6M objects, 52M attributes, and 6.7M relations for training video understanding models.

Abstract

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset containing over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets.

To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference.

Building on this resource, we train TRASER, a video scene graph generation model that augments VLM with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass.

On PVSG, VIPSeg, VidOR and SVG2_test, TRASER improves relation detection by +15~20%, object prediction by +30~40% over the strongest open-source baselines. When TRASER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5~4.6% absolute accuracy gain, demonstrating the utility of explicit spatio-temporal scene graphs.

Data Generation Pipeline

Our fully automated pipeline integrates SAM2, Describe Anything Model (DAM), and GPT-5 to produce dense, temporally grounded video scene graphs:

Phase 1: Panoptic Trajectory Generation - A two-stage online-offline tracking framework achieves dynamic object discovery and global temporal consistency using SAM2 with multi-scale grid prompts.
Phase 2: Object Description and Parsing - DAM-3B-Video generates detailed descriptions for each track, then GPT-4-nano extracts object names and attributes. SAM3-based verification filters unreliable labels.
Phase 3: Spatiotemporal Relation Extraction - GPT-5 infers inter-object relations including spatial, functional, stateful, motion, social, attentional, and event-level interactions.

Human verification on 100 sampled videos shows 93.8% accuracy for object labels, 79.0% for attributes, and 85.4% for relations.

SVG2 Dataset

SVG2 is the first large-scale video scene graph dataset with dense panoptic annotations. We sample 43K videos from SA-V and 593K videos from PVD, resulting in 6.6M object instances, 52M attributes, and 6.7M spatiotemporal relations.

Additionally, we construct SVG2_test, a human-annotated benchmark of 100 videos with multi-granularity panoptic annotations following hierarchical human visual perception.

Comparison with Related Benchmarks

Dataset	#Videos	Annotator	Type	Frame/ Vid	#Obj/ #Traj	Obj Cls	#Relations	Rel Cls	#Attributes
SA-V	50.9K	SAM-2 + Human	Seg_S	330	0.6M	-	-	-	-
VIPSeg	2.8K	Human	Seg_S	23	38.2K	124	-	-	-
VidVRD	0.8K	Human	Box_S	304	2.4K	35	25.9K	132	-
Action Genome	7.8K	Human	Box_S	808	26.2K	35	1.4M*	26	-
VidOR	7.0K	Human	Box_D	1K	34.6K	80	0.3M	50	-
PVSG	338	Human	Seg_S	375	6.3K	125	3.6K	62	-
SVG2 (Ours)	636K	Our Pipeline	Seg_D	479	6.6M	54.2K	6.7M	35.3K	52M
SVG2_test	100	Human	Seg_D	160	3.2K	749	3.3K	249	9.7K

Subscripts S/D denote sparse (sampled) vs. dense (per-frame) annotations. (*) Action Genome reports total frame-level relation instances, whereas others report unique trajectory-level instances.

Dataset Statistics

TRASER Model

TRASER is a VLM that produces structured video scene graphs in one forward pass from raw videos and panoptic object trajectories.

Key Components:

Trajectory-Aligned Token Arrangement: Binds ViT tokens to object trajectories based on segmentation coverage, producing identity-preserving token streams with explicit trajectory boundaries.
Object-Trajectory Resampler: Aggregates global semantics over each object's entire temporal span using Perceiver-Resampler with learnable latent queries.
Temporal-Window Resampler: Partitions video into temporal windows and resamples each window independently, preserving fine-grained motion and temporal dynamics crucial for relation detection.

The dual resampler design balances global object context with local temporal grounding, enabling accurate prediction of both object attributes and temporally-localized relations.

Results

We evaluate TRASER against leading proprietary and open-source VLMs on four benchmarks using panoptic object trajectories as input. TRASER outperforms all open-source baselines and surpasses GPT-5 in both object and attribute prediction.

Video Scene Graph Generation Results

Model	Triplet			Relation			Object				Attr	Rank
Model	PVSG	VidOR	SVG2_t	PVSG	VidOR	SVG2_t	VIPSeg	PVSG	VidOR	SVG2_t	SVG2_t	Rank
Proprietary Models (API)
GPT-4.1	6.0	10.8	6.4	7.3	11.9	7.5	59.9	51.4	86.6	58.5	15.8	3
Gemini-2.5 Pro	7.4	9.8	8.7	8.8	11.0	9.9	56.7	31.7	82.9	49.8	13.6	4
GPT-5	16.6	19.7	17.9	18.3	21.7	19.4	68.1	54.2	88.5	65.5	24.1	2
Open-Source Models
Qwen2.5-VL-3B	0.1	0.2	0.2	0.1	0.4	0.3	22.1	10.4	45.0	24.2	1.4	12
MiniCPM-V 4.5	0.1	3.0	1.1	0.2	4.0	2.4	40.0	14.3	59.1	38.5	8.4	8
InternVL3.5-4B	0.2	0.4	0.1	0.3	0.5	0.2	33.7	20.4	66.4	35.0	7.4	11
GLM-4.1-9B-Thinking	0.3	3.9	1.8	0.5	5.0	2.9	46.5	17.8	61.1	28.5	9.1	6
Qwen3-VL-4B	0.1	0.7	1.4	0.1	0.8	1.6	34.1	21.8	65.8	35.6	8.3	10
Qwen3-VL-4B-Thinking	0.1	2.3	3.3	0.4	3.4	3.6	35.8	18.3	67.6	37.1	8.8	7
FT-Qwen2.5-VL-3B (1st Bbox)	0.1	1.6	0.1	0.9	4.5	0.5	25.5	25.9	51.7	36.7	10.4	9
FT-Qwen2.5-VL-3B (Bbox Traj.)	0.5	1.8	1.4	1.6	4.2	3.0	35.1	33.6	56.9	46.1	13.4	5
TRASER (Ours)	16.1	22.9	16.7	16.9	25.0	18.7	86.5	72.7	91.4	79.0	27.1	1

For triplet and relation recall, we adopt an IoU threshold of 0.5. Bold green indicates best performance, underline indicates second best. SVG2_t = SVG2_test.

Video Question Answering with Scene Graphs

We assess the utility of structured video scene graphs for downstream video QA tasks. High-quality scene graphs from TRASER consistently improve GPT-4.1's VQA accuracy compared to video-only or Qwen2.5-VL augmented inputs.

Benchmark	Video Only	Video + Qwen2.5-VL's VSG	Video + TRASER's VSG
AGQA 2.0	25.9	24.8	26.3
Perception-Test	66.8	68.6	71.4

GPT-4.1 VQA accuracy (%) with different inputs. Incorporating TRASER's video scene graphs consistently improves performance, demonstrating the value of structured spatiotemporal representations.

TRASER outperforms open-source baselines by +15~20% in relation detection, and +30~40% in object prediction.
TRASER surpasses GPT-5 on object prediction (+13%) and attribute prediction (+3%).
When TRASER's scene graphs are used for video QA, they provide +1.5~4.6% accuracy gains over video-only inputs or videos augmented with Qwen2.5-VL generated scene graphs.

More Results in Our Paper

BibTeX

@article{gao2026svg2,
  author    = {Gao, Ziqi and Zhang, Jieyu and Ikezogwo, Wisdom Oluchi and Park, Jae Sung and You, Tario G and Ogbu, Daniel and Zheng, Chenhao and Huang, Weikai and Yang, Yinuo and Han, Winson and Kong, Quan and Saini, Rajat and Krishna, Ranjay},
  title     = {Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
  year      = {2026},
}