Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

1Allen Institute for AI 2University of Washington, 3Woven by Toyota

Synthetic Visual Genome 2 (SVG2) provides large-scale video scene graphs with 636K videos, 6.6M objects, 52M attributes, and 6.7M relations for training video understanding models.

Abstract

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset containing over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets.

To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference.

Building on this resource, we train TRASER, a video scene graph generation model that augments VLM with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass.

On PVSG, VIPSeg, VidOR and SVG2test, TRASER improves relation detection by +15~20%, object prediction by +30~40% over the strongest open-source baselines. When TRASER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5~4.6% absolute accuracy gain, demonstrating the utility of explicit spatio-temporal scene graphs.

Data Generation Pipeline

Pipeline

Our fully automated pipeline integrates SAM2, Describe Anything Model (DAM), and GPT-5 to produce dense, temporally grounded video scene graphs:

  • Phase 1: Panoptic Trajectory Generation - A two-stage online-offline tracking framework achieves dynamic object discovery and global temporal consistency using SAM2 with multi-scale grid prompts.
  • Phase 2: Object Description and Parsing - DAM-3B-Video generates detailed descriptions for each track, then GPT-4-nano extracts object names and attributes. SAM3-based verification filters unreliable labels.
  • Phase 3: Spatiotemporal Relation Extraction - GPT-5 infers inter-object relations including spatial, functional, stateful, motion, social, attentional, and event-level interactions.

Human verification on 100 sampled videos shows 93.8% accuracy for object labels, 79.0% for attributes, and 85.4% for relations.

SVG2 Dataset

SVG2 is the first large-scale video scene graph dataset with dense panoptic annotations. We sample 43K videos from SA-V and 593K videos from PVD, resulting in 6.6M object instances, 52M attributes, and 6.7M spatiotemporal relations.

Additionally, we construct SVG2test, a human-annotated benchmark of 100 videos with multi-granularity panoptic annotations following hierarchical human visual perception.

Comparison with Related Benchmarks

Dataset #Videos Annotator Type Frame/ Vid #Obj/ #Traj Obj Cls #Relations Rel Cls #Attributes
SA-V 50.9K SAM-2 + Human SegS 330 0.6M - - - -
VIPSeg 2.8K Human SegS 23 38.2K 124 - - -
VidVRD 0.8K Human BoxS 304 2.4K 35 25.9K 132 -
Action Genome 7.8K Human BoxS 808 26.2K 35 1.4M* 26 -
VidOR 7.0K Human BoxD 1K 34.6K 80 0.3M 50 -
PVSG 338 Human SegS 375 6.3K 125 3.6K 62 -
SVG2 (Ours) 636K Our Pipeline SegD 479 6.6M 54.2K 6.7M 35.3K 52M
SVG2test 100 Human SegD 160 3.2K 749 3.3K 249 9.7K
Subscripts S/D denote sparse (sampled) vs. dense (per-frame) annotations. (*) Action Genome reports total frame-level relation instances, whereas others report unique trajectory-level instances.

Dataset Statistics

Object Statistics
Attribute Statistics
Relationship Overview

TRASER Model

TRASER Model Architecture

TRASER is a VLM that produces structured video scene graphs in one forward pass from raw videos and panoptic object trajectories.

Key Components:

  • Trajectory-Aligned Token Arrangement: Binds ViT tokens to object trajectories based on segmentation coverage, producing identity-preserving token streams with explicit trajectory boundaries.
  • Object-Trajectory Resampler: Aggregates global semantics over each object's entire temporal span using Perceiver-Resampler with learnable latent queries.
  • Temporal-Window Resampler: Partitions video into temporal windows and resamples each window independently, preserving fine-grained motion and temporal dynamics crucial for relation detection.

The dual resampler design balances global object context with local temporal grounding, enabling accurate prediction of both object attributes and temporally-localized relations.

Results

We evaluate TRASER against leading proprietary and open-source VLMs on four benchmarks using panoptic object trajectories as input. TRASER outperforms all open-source baselines and surpasses GPT-5 in both object and attribute prediction.

Video Scene Graph Generation Results

Model Triplet Relation Object Attr Rank
PVSG VidOR SVG2t PVSG VidOR SVG2t VIPSeg PVSG VidOR SVG2t SVG2t
Proprietary Models (API)
GPT-4.1 6.0 10.8 6.4 7.3 11.9 7.5 59.9 51.4 86.6 58.5 15.8 3
Gemini-2.5 Pro 7.4 9.8 8.7 8.8 11.0 9.9 56.7 31.7 82.9 49.8 13.6 4
GPT-5 16.6 19.7 17.9 18.3 21.7 19.4 68.1 54.2 88.5 65.5 24.1 2
Open-Source Models
Qwen2.5-VL-3B 0.1 0.2 0.2 0.1 0.4 0.3 22.1 10.4 45.0 24.2 1.4 12
MiniCPM-V 4.5 0.1 3.0 1.1 0.2 4.0 2.4 40.0 14.3 59.1 38.5 8.4 8
InternVL3.5-4B 0.2 0.4 0.1 0.3 0.5 0.2 33.7 20.4 66.4 35.0 7.4 11
GLM-4.1-9B-Thinking 0.3 3.9 1.8 0.5 5.0 2.9 46.5 17.8 61.1 28.5 9.1 6
Qwen3-VL-4B 0.1 0.7 1.4 0.1 0.8 1.6 34.1 21.8 65.8 35.6 8.3 10
Qwen3-VL-4B-Thinking 0.1 2.3 3.3 0.4 3.4 3.6 35.8 18.3 67.6 37.1 8.8 7
FT-Qwen2.5-VL-3B (1st Bbox) 0.1 1.6 0.1 0.9 4.5 0.5 25.5 25.9 51.7 36.7 10.4 9
FT-Qwen2.5-VL-3B (Bbox Traj.) 0.5 1.8 1.4 1.6 4.2 3.0 35.1 33.6 56.9 46.1 13.4 5
TRASER (Ours) 16.1 22.9 16.7 16.9 25.0 18.7 86.5 72.7 91.4 79.0 27.1 1
For triplet and relation recall, we adopt an IoU threshold of 0.5. Bold green indicates best performance, underline indicates second best. SVG2t = SVG2test.

Video Question Answering with Scene Graphs

We assess the utility of structured video scene graphs for downstream video QA tasks. High-quality scene graphs from TRASER consistently improve GPT-4.1's VQA accuracy compared to video-only or Qwen2.5-VL augmented inputs.

Benchmark Video Only Video + Qwen2.5-VL's VSG Video + TRASER's VSG
AGQA 2.0 25.9 24.8 26.3
Perception-Test 66.8 68.6 71.4
GPT-4.1 VQA accuracy (%) with different inputs. Incorporating TRASER's video scene graphs consistently improves performance, demonstrating the value of structured spatiotemporal representations.

  • TRASER outperforms open-source baselines by +15~20% in relation detection, and +30~40% in object prediction.
  • TRASER surpasses GPT-5 on object prediction (+13%) and attribute prediction (+3%).
  • When TRASER's scene graphs are used for video QA, they provide +1.5~4.6% accuracy gains over video-only inputs or videos augmented with Qwen2.5-VL generated scene graphs.

BibTeX

@article{gao2026svg2,
  author    = {Gao, Ziqi and Zhang, Jieyu and Ikezogwo, Wisdom Oluchi and Park, Jae Sung and You, Tario G and Ogbu, Daniel and Zheng, Chenhao and Huang, Weikai and Yang, Yinuo and Han, Winson and Kong, Quan and Saini, Rajat and Krishna, Ranjay},
  title     = {Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
  year      = {2026},
}