Logo

VaPRVision-language Preference alignment for Reasoning

Accepted at COLM 2025 🎉

1Department of Computer Science, University of California Los Angeles  
2Amazon.com Inc.  
*Corresponding Author  
Arxiv Code

🤗

Dataset

🤖

Models

Abstract

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the chosen ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer "Yes" in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on VaPR, which is synthesized using GPT-4o.

VaPR Framework


Motivation

Existing preference optimization datasets for Large Vision Language Models (LVLMs) have shown to contain noise in the form of stylistic and length biases. To address this, we introduce a novel hard-negative response generation framework that utilizes LLM-guided response editing to create rejected responses with targeted errors, while maintaining stylistic and length similarity to the accepted ones. Leveraging this framework, we develop the VaPR dataset, which consists of 30K high-quality preference samples with chosen responses paired with hard-negative rejected responses.


Data Generation and Statistics

Three stages pipeline to generate 30K hard-negative preference pairs using LLaVA-v1.5-665K dataset. Stage 1: Filter out irrelevant samples (e.g., MCQs). Stage 2: Categorize remaining samples based on task. Stage 3: Use task-specific prompts (with optional penalty lists) to produce stylistically and length-wise similar but content-distinct negative responses.

Data Generation Pipeline

Task distribution of the VaPR dataset

Data Distribution

Comparison of VaPR vs. HA-DPO, POVID, RLAIF-V, and CSR on stylistic similarity (measured via linguistic similarity using LD, i.e., Levenshtein distance) and length balance (token-length deltas). We report the % of cases where the chosen response is longer (chosen > rejected) or shorter (rejected > chosen) across all samples, as well as, samples with short (<100 tokens), and long responses. We also report average token length deltas for (chosen > rejected) and (rejected > chosen). Lower Levenshtein distance and smaller token deltas indicate higher stylistic and length similarity, respectively. CSR includes only has long responses.
VaPR has the lowest Levenshtein distance and token length difference, reinforcing the hard-negative nature of its rejections.

Preference Dataset Comparison

In the illustration below, we show how VaPR-generated hard-negative responses reduce stylistic and length biases compared to biased rejections, giving examples of fine-grained perception and reasoning tasks like counting and spatial reasoning. We also include additional examples covering perturbations in other categories.


How Preference Data Effects Direct Preference Optimization (DPO)

To understand how different preference datasets affect DPO training, we analyze two key metrics: reference model log-probabilities and reward accuracy during training. The quality and characteristics of preference pairs significantly impact the stability and effectiveness of the optimization process.

Reference Log-Probs Comparison

Reference Log-Probabilities

DPO Reward Accuracy

DPO Reward Accuracy

Reference Log-Probabilities Analysis: The corpus-level log p_ref(chosen vs. rejected) reveals distinct patterns across datasets. SIMA shows nearly identical pairs with Δ≈0, indicating minimal difference between chosen and rejected responses. POVID exhibits very large differences (|Δ| large), suggesting overly dissimilar preference pairs. In contrast, VaPR demonstrates moderate separation, which is indicative of well-crafted hard negatives that provide meaningful learning signals without being trivially different.

DPO Reward Accuracy During Training: The training dynamics further validate these observations. Models trained on SIMA achieve only ~50% reward accuracy, indicating DPO collapse where the model cannot distinguish between chosen and rejected responses. POVID leads to ~100% accuracy, suggesting reward hacking where the model exploits superficial differences rather than learning meaningful preferences. VaPR shows gradual improvement to ~90% without saturating, indicating better generalization from exposure to more challenging preference pairs.

Key Insight: Very similar preference pairs lead to DPO collapse, while very dissimilar pairs result in reward hacking. VaPR's hard negatives avoid both failure modes, enabling stable and effective preference optimization.

Main Results

Performance comparison of LLaVA-v1.5-Instruct, Qwen2VL-Instruct, Qwen2.5VL-Instruct, SFT and DPO models finetuned on VaPR, and other preference datasets across 2B, 3B, 7B, and 13B parameter sizes on ten benchmarks. Higher scores indicate better performance across all benchmarks, with the highest score for each benchmark highlighted in bold. All models are evaluated using publicly available checkpoints, adhering to evaluation parameters prescribed by the benchmarks. represents results that show statistically significant improvement via bootstrap resampling (95% CI).

Main Results

Key Takeaways:

  • VaPR models beat baselines on most benchmarks, with gains of 7% (LLaVA-7B), 6% (LLaVA-13B), 5% (Qwen2VL-2B), 3% (Qwen2VL-7B), 2% (Qwen2.5VL-3B), and 1% (Qwen2.5VL-7B), especially in vision-centric and adversarial tasks, highlighting stronger visio-linguistic compositionality.
  • Improvements extend beyond perception and reasoning to textual and mathematical benchmarks, despite no explicit training, suggesting enhanced fine-grained perception and spatial reasoning.
  • Preference optimization outperforms supervised finetuning (SFT): while SFT can degrade performance due to overfitting on small datasets, DPO with carefully constructed preference pairs yields robust and generalizable improvements.
  • Overall, the gains arise from the preference signal rather than dataset size alone, showing that preference finetuning is more effective than standard SFT for alignment and reasoning.

Data Scaling analysis

We analyze the effect of dataset size on performance using 3K, 10K, and 30K samples from VaPR. As shown illustrated in the figure below (X-axis spacing between 3K, 10K, and 30K is not uniform), all models improve with more data. LLaVA-VaPR shows large gains even at 3K, but with diminishing returns at higher scales. Qwen2VL and Qwen2.5VL improve less at 3K yet benefit more from larger datasets, reflecting their stronger pretrained priors compared to LLaVA-v1.5.

Main Results

VaPR models do not overconfidently say "Yes" to "Yes/No" questions

Large vision-language models (LVLMs) tend to favor "Yes" over "No" in binary questions. On NaturalBench, an adversarial benchmark, base SFT models like LLaVA-v1.5 and Qwen variants often answer "Yes" even when the correct answer is "No." VaPR models reduce this bias, with LLaVA-VaPR 13B showing the strongest shift toward accurate "No" responses. This improvement reflects enhanced perception, reasoning, and visio-linguistic compositionality. The following figure captures this phenomenon, where dotted rectangles highlight incorrect "Yes" (red) and "No" (yellow) predictions, illustrating reduced "Yes" bias after preference finetuning.

Main Results


VaPR-OS: Ablation Study using open-source LLM Editor

We conduct an ablation with Qwen3-32B as an open-source editor, creating VaPR-OS from the same 10K subset. VaPR-OS shows comparable hard-negative characteristics to VaPR (token length gap 6 vs. 3; Levenshtein distance 10 vs. 6), and models trained on it reach ~99% of the performance of their GPT-4o-based counterparts, while still outperforming base instruct models. These results demonstrate that VaPR generalizes effectively to open-source editors, making it accessible for broader research and applications. The following table summarizes these findings for 3 models LLaVA-v1.5-7B, Qwen2VL-2B, and Qwen2.5VL-3B.

Main Results

We refer the readers to our paper for more details on experiments, analyses, and ablations.

BibTeX

@inproceedings{
wadhawan2025vapr,
title={Va{PR} - Vision-language Preference alignment for Reasoning},
author={Rohan Wadhawan and Fabrice Y Harel-Canada and Zi-Yi Dou and Suhaila Shakiah and Robinson Piramuthu and Nanyun Peng},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=uBAubFwymy}
}