Vision-Language Programs pair VLM perception with symbolic program synthesis to produce executable, interpretable rules for few-shot visual reasoning.
Vision-Language models achieve strong performance on multimodal tasks but often fail at systematic visual reasoning. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with the structured reasoning of program synthesis. VLP asks a VLM to produce type-constrained visual descriptors, compiles them into executable programs, and searches for the most accurate, highest-likelihood rule under a probabilistic grammar. The resulting programs execute on images, remain consistent with task constraints, and provide human-interpretable explanations that enable shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, especially on tasks requiring complex logical reasoning.
VLP cleanly splits perception from reasoning: a VLM supplies symbols, program synthesis handles logic, and programs execute directly on images. The PCFG prior steers search toward concise, high-accuracy rules that honor every positive and negative example.
On Bongard-RWR, direct VLM prompting hallucinated an “abundance” rule. VLP grounded objects, inferred the property round, and synthesized a concise executable program—correctly classifying all queries.
@article{wuest2025vlp,
title={Synthesizing Visual Concepts as Vision-Language Programs},
author={W{\"u}st, Antonia and Stammer, Wolfgang and Shindo, Hikaru and Dhami, Devendra Singh and Helff, Lukas and Kersting, Kristian},
journal={arXiv preprint},
year={2025}
}