How to Train your Text‑to‑Image Model: Evaluating Design Choices for Synthetic Training Captions

Adobe Applied Research · hessian.AI · TU Darmstadt · DFKI
arXiv preprint 2025

Abstract

Training data is at the core of any successful text‑to‑image model. Yet web‑scraped image captions are notoriously noisy and inconsistent. Recent works therefore replace them with synthetic captions – but optimal design choices remain unclear. We systematically investigate how caption density, quality and diversity affect downstream diffusion models. Dense, high‑quality captions improve prompt alignment but often hurt output aesthetics and variety. Conversely, sampling captions of randomized length yields balanced gains in aesthetics and alignment without sacrificing diversity. Finally, we show that caption distributions strongly influence societal bias in generated images. Our study provides practical guidance for crafting more effective training data for text‑to‑image generation.

Core Insights

We ran dozens of experiments on different synthetic captioning strategies for text-to-image pretraining influece downstream so you don't have to. Here are the main takeaways.

Key Recommendations:

  • Use strong VLM: High-quality base captions from advanced models with carefully crafted prompts significantly improve downstream performance
  • Use varied caption lengths: Random sampling of caption lengths (5-50 words) provides the best balance of aesthetics and alignment
  • Avoid overly dense captions: While detailed descriptions improve prompt following, they can hurt visual quality and diversity. Especially for short prompts.

1. Model Selection & Caption Density vs Quality Trade-offs

  • Stronger, larger VLMs produce better captions leading to better downstream performance
  • Dense, high-quality captions improve prompt alignment
  • However, they reduce output aesthetics scores
  • Variety in generated images decreases with overly detailed captions

2. Diversify Training Captions

  • Diversity in training captions mitigates aesthetic and alignment trade-offs
  • Randomizing caption lengths provides optimal balance
  • Improves aesthetics & diversity while maintaining alignment gains

  • No improvments from varying captions between epochs or using personas in captioning.

3. Bias Amplification & Mitigation

  • Caption distributions directly influence societal bias in generated images
  • Gender bias increases with stereotypical caption patterns

Datasets

We provide two datasets to facilitate further research and development in synthetic caption design for text-to-image models.

Synthetic Captions Dataset

We release all synthetic captions generated in our study.

  • Over 39M synthetic captions for 1M LAION aesthetics images
  • Sampled from different VLM models
  • Diverse captions over density, captioning setup and sampling strategies

Controlled Evaluation Prompts

Our density controlled evaluation prompts designed to test alignment, aesthetics, and output diversity.

  • 4340 English text-to image prompts
  • Organized into 4 parallel sets at different levels of density/descriptiveness (minimal → long).
  • 1085 short prompts sourced from DrawBench and Parti-Prompts.
  • Rewritten by GPT-4o in different densities while preserving the original meaning and subject.

BibTeX

If you find our work useful, or use our datasets, please cite

@article{brack2025howtotrain,
  title={How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions},
  author={Manuel Brack and Sudeep Katakol and Felix Friedrich and Patrick Schramowski and Hareesh Ravi and Kristian Kersting and Ajinkya Kale},
  journal={arXiv preprint arXiv:2506.16679},
  year={2025}
}