Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language Programs pair VLM perception with symbolic program synthesis to produce executable, interpretable rules for few-shot visual reasoning.

¹AI/ML Lab, TU Darmstadt ²Max Planck Institute for Informatics ³Hessian Center for AI (hessian.AI) ⁴Uncertainty in AI Group, TU Eindhoven ⁵German Research Center for AI (DFKI)

Why Vision-Language Programs?

VLMs excel at perception but fall apart on systematic visual reasoning, often hallucinating rules that violate constraints. VLP keeps VLMs for perception while delegating logic to program synthesis, so rules stay executable and auditable.

Structured visual descriptions become neurosymbolic programs.
Programs run directly on images, respecting positive/negative evidence.
Transparent explanations make shortcut mitigation straightforward.

Abstract

Vision-Language models achieve strong performance on multimodal tasks but often fail at systematic visual reasoning. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with the structured reasoning of program synthesis. VLP asks a VLM to produce type-constrained visual descriptors, compiles them into executable programs, and searches for the most accurate, highest-likelihood rule under a probabilistic grammar. The resulting programs execute on images, remain consistent with task constraints, and provide human-interpretable explanations that enable shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, especially on tasks requiring complex logical reasoning.

Method at a glance

Symbol grounding: VLM proposes objects, properties, and actions for the task.

Vision-language DSL: Neural perception functions and symbolic operators form a probabilistic grammar.

Program search: Heap search finds the most accurate, most probable executable rule.

Key findings

Up to +13.5% balanced accuracy over raw VLM prompting.

Small VLMs profit the most, retaining performance on out-of-distribution synthetic scenes.

Structured programs stay consistent even when dataset labels are noisy.

DSL edits enable knowledge insertion and shortcut mitigation.

Vision-Language Programs Pipeline

VLP cleanly splits perception from reasoning: a VLM supplies symbols, program synthesis handles logic, and programs execute directly on images. The PCFG prior steers search toward concise, high-accuracy rules that honor every positive and negative example.

@article{wuest2025vlp, title={Synthesizing Visual Concepts as Vision-Language Programs}, author={W{\"u}st, Antonia and Stammer, Wolfgang and Shindo, Hikaru and Dhami, Devendra Singh and Helff, Lukas and Kersting, Kristian}, journal={arXiv preprint}, year={2025} }