SmolDocling: An Ultra-Compact VLM for Document Understanding

Featuring DocTags for document markup!

Mar 25, 2025

∙ Paid

SmolDocling seems to be a significant advancement in compact document understanding models. This 256M parameter vision-language model is designed for efficient document processing while maintaining high performance across a range of document understanding tasks. Developed by researchers from IBM Research and HuggingFace, this model bridges the gap between large, resource-intensive models and more specialized ensemble approaches. I also like the name because it sounds like “Smol Duckling” and it’s nice to get a cute model name every once in a while.

Ahem…

Architecture and Design

Refer to caption — “Figure 1:**SmolDocling/SmolVLM architecture.** SmolDocling converts images of document pages to *DocTags* sequences. First, input images are encoded using a vision encoder and reshaped via projection and pooling. Then, the projected embeddings are concatenated with the text embeddings of the user prompt, possibly with interleaving. Finally, the sequence is used by an LLM to autoregressively predict the *DocTags* sequence.”

SmolDocling is built upon the SmolVLM architecture approach, specifically using the SmolVLM-256M variant. It consists of a SigLIP base patch-16/512 (93M) visual backbone and a lightweight variant of the SmolLM-2 family (135M) language backbone. This makes it between 5 and 10 times smaller in parameters than comparable vision-language models, and up to 27 times smaller than some models it outperforms.

Keep reading with a 7-day free trial

Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.