AIModels.fyi

AIModels.fyi

Can unified multimodal models align understanding and generation, without *any* captions?

Reconstruction alignment improves unified multimodal models

aimodels-fyi's avatar
aimodels-fyi
Sep 24, 2025
∙ Paid
2
Share

Unified multimodal models (UMMs) represent an ambitious goal in AI: creating single architectures that can both understand and generate visual content, much like how large language models have revolutionized text processing. These models aim to inherit the reasoning capabilities of language models while extending them to handle images.

However, UMMs face a fundamental limitation. Current training approaches rely on image-text pairs where captions provide supervision, but these captions are inherently sparse and miss critical visual details. Even captions spanning hundreds of words fail to capture essential elements like spatial layout, geometry, textures, and fine-grained attributes.

UMMs can often correctly recognize an uncommon concept (yellow broccoli) but fail to generate it, revealing misalignment between understanding and generation.

This creates a systematic bias. For instance, since captions rarely describe broccoli’s color, models overfit to the rule “broccoli → green,” often failing on prompts like “a yellow broccoli.” The model can recognize yellow broccoli when it sees one, but cannot generate it when asked — revealing a misalignment between understanding and generation capabilities.

Introducing Reconstruction Alignment (RecA)

The researchers introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that addresses the sparse supervision problem through dense visual embeddings. Instead of relying on incomplete text descriptions, RecA leverages embeddings from visual understanding encoders like CLIP and SigLIP, which map pixels into a language-aligned semantic space.

Dense supervision from visual embeddings. Typical image generation models trained on sparse text captions miss crucial visual details that understanding encoders like CLIP can preserve.

The key insight is that visual understanding encoders capture semantic structure far more effectively than generation encoders. These semantic embeddings provide dense, semantically grounded supervision without requiring paired captions, raising the central question: can we improve generation capabilities by training models with semantic embeddings as maximally informative “text prompts”?

Keep reading with a 7-day free trial

Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 AIModels.fyi
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture