How to Train your Text‑to‑Image Model

Abstract

Training data is at the core of any successful text‑to‑image model. Yet web‑scraped image captions are notoriously noisy and inconsistent. Recent works therefore replace them with synthetic captions – but optimal design choices remain unclear. We systematically investigate how caption density, quality and diversity affect downstream diffusion models. Dense, high‑quality captions improve prompt alignment but often hurt output aesthetics and variety. Conversely, sampling captions of randomized length yields balanced gains in aesthetics and alignment without sacrificing diversity. Finally, we show that caption distributions strongly influence societal bias in generated images. Our study provides practical guidance for crafting more effective training data for text‑to‑image generation.

Core Insights

We ran dozens of experiments on different synthetic captioning strategies for text-to-image pretraining influece downstream so you don't have to. Here are the main takeaways.

Key Recommendations:

Use strong VLM: High-quality base captions from advanced models with carefully crafted prompts significantly improve downstream performance
Use varied caption lengths: Random sampling of caption lengths (5-50 words) provides the best balance of aesthetics and alignment
Avoid overly dense captions: While detailed descriptions improve prompt following, they can hurt visual quality and diversity. Especially for short prompts.

1. Model Selection & Caption Density vs Quality Trade-offs

Stronger, larger VLMs produce better captions leading to better downstream performance
Dense, high-quality captions improve prompt alignment
However, they reduce output aesthetics scores
Variety in generated images decreases with overly detailed captions

2. Diversify Training Captions

Diversity in training captions mitigates aesthetic and alignment trade-offs
Randomizing caption lengths provides optimal balance
Improves aesthetics & diversity while maintaining alignment gains

No improvments from varying captions between epochs or using personas in captioning.

3. Bias Amplification & Mitigation

Caption distributions directly influence societal bias in generated images
Gender bias increases with stereotypical caption patterns

Datasets

We provide two datasets to facilitate further research and development in synthetic caption design for text-to-image models.

Synthetic Captions Dataset

We release all synthetic captions generated in our study.

Over 39M synthetic captions for 1M LAION aesthetics images
Sampled from different VLM models
Diverse captions over density, captioning setup and sampling strategies

Download Captions on HuggingFace

Controlled Evaluation Prompts

Our density controlled evaluation prompts designed to test alignment, aesthetics, and output diversity.

4340 English text-to image prompts
Organized into 4 parallel sets at different levels of density/descriptiveness (minimal → long).
1085 short prompts sourced from DrawBench and Parti-Prompts.
Rewritten by GPT-4o in different densities while preserving the original meaning and subject.

Evaluation Prompts on HuggingFace

BibTeX

If you find our work useful, or use our datasets, please cite

@article{brack2025howtotrain,
  title={How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions},
  author={Manuel Brack and Sudeep Katakol and Felix Friedrich and Patrick Schramowski and Hareesh Ravi and Kristian Kersting and Ajinkya Kale},
  journal={arXiv preprint arXiv:2506.16679},
  year={2025}
}

How to Train your Text‑to‑Image Model: Evaluating Design Choices for Synthetic Training Captions