Generate Any Scene

Abstract

Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge.

We introduce Generate Any Scene (GAS), a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. GAS dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, GAS translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment.

Using GAS, we first design a self-improving framework where models iteratively enhance their performance using generated data. SD v1.5 achieves an average 4% improvement over baselines and surpasses fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune SD v1.5 and achieve a 10% increase in TIFA score for compositional and hard-concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

What is Generate Any Scene?

Modern text-to-vision models produce high-fidelity visuals but struggle with compositional generalization and semantic alignment. Existing real-world datasets (like CC3M) are noisy, weakly compositional, and lack dense, high-quality annotations. Constructing a compositional dataset requires that we first define the space of the visual content. Scene graphs are one such representation of the visual space, grounded in cognitive science. A scene graph represents objects in a scene as individual nodes in a graph. Each object is modified by attributes, which describe its properties. Relationships are edges that connect the nodes. Scene graphs provide explicit compositional structure.

We introduce Generate Any Scene (GAS), a controllable scene-graph-driven data engine that produces synthetic captions and QA for training, evaluation, reward modeling, and robustness improvement in text-to-vision systems. GAS systematically enumerates scene graphs representing the combinatorial array of possible visual scenes.

Generating Data with Scene Graphs

To construct a scene graph, we use three main metadata types: objects, attributes, and relations. We further introduce scene attributes that capture global visual contexts, such as art style, to facilitate comprehensive caption synthesis. We build a hierarchical taxonomy that categorizes metadata into distinct levels and types, enabling fine-grained analysis.

1

Scene graph structure enumeration Our engine pre-computes a library of directed scene-graph topologies subject to user-specified structural constraints: complexity (total number of objects, relations, and attributes), average node degree, and number of connected components. We first sample the number of object nodes and then systematically enumerate feasible edge sets and attribute attachments that satisfy these constraints.

2

Populate the scene graph structure with metadata Given a generated scene graph structure, for each object node, attribute node, and relation edge, we sample the corresponding content from our metadata. This process is highly customizable and controllable: users can define the topics and types of metadata to include and easily extend the diversity and richness of scene graphs by adding new metadata.

3

Sample scene attributes We also include scene attributes that describe aspects such as the art style, viewpoint, time span (for video), and 3D attributes (for 3D content). These scene attributes are sampled directly from our metadata, creating a list that provides contextual details to enrich the description of the visual content.

4

Translate scene graph to caption We introduce a deterministic and programmatic algorithm that converts scene graphs with scene attributes into captions. It traverses scene graphs by converting objects/attributes/relations into descriptive text in topological order, while tracking each object's references to ensure coherence.

5

Convert scene graph to question-answer pairs Given a synthetic scene graph, GAS automatically enumerates exhaustive QA pairs using templates that query object attributes (e.g., What color is the sphere?), spatial relations (e.g., What is to the left of the cube?), and other compositional elements. Each answer maps directly to an object, attribute, or edge, ensuring full coverage of the graph at minimal cost.

Taxonomy of Visual Elements

Summary of the quantities and sources of visual elements:

Metadata Type	Count	Source
Objects	28,787	WordNet
Attributes	1,494	Wikipedia, curated lists
Relations	10,492	Synthetic Visual Genome
Scene Attributes	2,193	Places365, curated lists

Applications

1. Self-Improving Models

GAS-generated captions can be used to iteratively fine-tune a model on its own synthetic distribution. Applying this loop to Stable Diffusion v1.5 yields approximately 4% improvement in compositional quality over the pre-trained baseline, and surpasses fine-tuning on the real-image CC3M dataset — without any human-annotated data.

2. Distilling Targeted Capabilities

GAS enables targeted capability transfer from strong models to weaker ones. Using GAS-structured prompts, we generate fewer than 800 synthetic image-caption pairs from DALL-E 3 and fine-tune SD v1.5 on them. This compact distillation achieves a ~10% improvement in TIFA score on compositional generation tasks — demonstrating that structured synthetic data is highly efficient for closing compositional gaps.

3. Reinforcement Learning with Synthetic Reward

Scene-graph QA pairs from GAS serve as verifiable reward signals for reinforcement learning with GRPO. Applied to SimpleAR-0.5B-SFT, this approach achieves a improvement on DPG-Bench compared to CLIP-based reward methods — highlighting that structured, interpretable rewards derived from scene graphs are more effective than perceptual similarity metrics alone.

4. Generated-Content Detection

GAS synthetic data can train detectors that distinguish AI-generated images from real ones. A lightweight ViT-T detector trained on GAS-generated images improves F1 scores across both cross-model and cross-dataset evaluation scenarios, demonstrating that controlled, diverse synthetic data is a reliable resource for content provenance and moderation tasks.

BibTeX

@inproceedings{gao2026generate,
  title={Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training},
  author={Ziqi Gao and Weikai Huang and Jieyu Zhang and Aniruddha Kembhavi and Ranjay Krishna},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training