Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language Programs pair VLM perception with symbolic program synthesis to produce executable, interpretable rules for few-shot visual reasoning.

1AI/ML Lab, TU Darmstadt 2Max Planck Institute for Informatics 3Hessian Center for AI (hessian.AI) 4Uncertainty in AI Group, TU Eindhoven 5German Research Center for AI (DFKI)
Motivation figure

Why Vision-Language Programs?

VLMs excel at perception but fall apart on systematic visual reasoning, often hallucinating rules that violate constraints. VLP keeps VLMs for perception while delegating logic to program synthesis, so rules stay executable and auditable.

  • Structured visual descriptions become neurosymbolic programs.
  • Programs run directly on images, respecting positive/negative evidence.
  • Transparent explanations make shortcut mitigation straightforward.

Abstract

Vision-Language models achieve strong performance on multimodal tasks but often fail at systematic visual reasoning. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with the structured reasoning of program synthesis. VLP asks a VLM to produce type-constrained visual descriptors, compiles them into executable programs, and searches for the most accurate, highest-likelihood rule under a probabilistic grammar. The resulting programs execute on images, remain consistent with task constraints, and provide human-interpretable explanations that enable shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, especially on tasks requiring complex logical reasoning.

Method at a glance

  • Symbol grounding: VLM proposes objects, properties, and actions for the task.
  • Vision-language DSL: Neural perception functions and symbolic operators form a probabilistic grammar.
  • Program search: Heap search finds the most accurate, most probable executable rule.

Key findings

  • Up to +13.5% balanced accuracy over raw VLM prompting.
  • Small VLMs profit the most, retaining performance on out-of-distribution synthetic scenes.
  • Structured programs stay consistent even when dataset labels are noisy.
  • DSL edits enable knowledge insertion and shortcut mitigation.

Vision-Language Programs Pipeline

VLP cleanly splits perception from reasoning: a VLM supplies symbols, program synthesis handles logic, and programs execute directly on images. The PCFG prior steers search toward concise, high-accuracy rules that honor every positive and negative example.

VLP pipeline overview

Qualitative reasoning

On Bongard-RWR, direct VLM prompting hallucinated an “abundance” rule. VLP grounded objects, inferred the property round, and synthesized a concise executable program—correctly classifying all queries.

VLP qualitative example on Bongard-RWR

BibTeX

@article{wuest2025vlp,
  title={Synthesizing Visual Concepts as Vision-Language Programs},
  author={W{\"u}st, Antonia and Stammer, Wolfgang and Shindo, Hikaru and Dhami, Devendra Singh and Helff, Lukas and Kersting, Kristian},
  journal={arXiv preprint},
  year={2025}
}