Key Recommendations:
- Use strong VLM: High-quality base captions from advanced models with carefully crafted prompts significantly improve downstream performance
- Use varied caption lengths: Random sampling of caption lengths (5-50 words) provides the best balance of aesthetics and alignment
- Avoid overly dense captions: While detailed descriptions improve prompt following, they can hurt visual quality and diversity. Especially for short prompts.