Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language Programs pair VLM perception with symbolic program synthesis to produce executable, interpretable rules for few-shot visual reasoning.

¹AI/ML Lab, TU Darmstadt ²Max Planck Institute for Informatics ³Hessian Center for AI (hessian.AI) ⁴Uncertainty in AI Group, TU Eindhoven ⁵German Research Center for AI (DFKI)

Why Vision-Language Programs?

VLMs excel at perception but fall apart on systematic visual reasoning, often hallucinating rules that violate constraints. VLP keeps VLMs for perception while delegating logic to program synthesis, so rules stay executable and auditable.

Structured visual descriptions become neurosymbolic programs.
Programs run directly on images, respecting positive/negative evidence.
Transparent explanations make shortcut mitigation straightforward.

Abstract

Vision-Language models achieve strong performance on multimodal tasks but often fail at systematic visual reasoning. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with the structured reasoning of program synthesis. VLP asks a VLM to produce type-constrained visual descriptors, compiles them into executable programs, and searches for the most accurate, highest-likelihood rule under a probabilistic grammar. The resulting programs execute on images, remain consistent with task constraints, and provide human-interpretable explanations that enable shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, especially on tasks requiring complex logical reasoning.

Method Walkthrough

The animation below shows the full VLP pipeline on a birthday recognition task. First, a VLM is prompted with the task images to identify relevant symbols in the image (here objects, properties, and actions). These are compiled into a typed DSL, and then a search over programs to find the best rule is performed.

Method at a glance

Symbol grounding: VLM proposes objects, properties, and actions for the task.

Vision-language DSL: Neural perception functions and symbolic operators form a probabilistic grammar.

Program search: Heap search finds the most accurate, most probable executable rule.

Key findings

Up to +13.5% balanced accuracy over raw VLM prompting.

Small VLMs profit the most, retaining performance on out-of-distribution synthetic scenes.

Structured programs stay consistent even when dataset labels are noisy.

DSL edits enable knowledge insertion and shortcut mitigation.

Program Execution

Once the program p* is found, it is executed on each query image. The VLM perception function get_objects grounds objects in the image; the symbolic functions exists_object and exists_obj_with_property check for typed symbols; the logical and combines the results into a final label.

BibTeX

@article{wuest2026vlp,
  title={Synthesizing Visual Concepts as Vision-Language Programs},
  author={W{\"u}st, Antonia and Stammer, Wolfgang and Shindo, Hikaru and Dhami, Devendra Singh and Helff, Lukas and Kersting, Kristian},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}