Molmo - Leading Multimodal AI
Molmo is a groundbreaking family of open, state-of-the-art multimodal AI models. Our top model rivals proprietary systems across both academic benchmarks and human evaluations, while our smaller models outperform competitors up to 10 times their size.
Although current multimodal models can interpret diverse data and express it in natural language, Molmo goes further. By learning to point at objects it perceives, Molmo facilitates richer interactions with the physical and virtual world, unlocking the potential for next-gen applications that can both act and interact within their environments.
Open, Cutting-Edge, and Actionable
Most advanced multimodal models today are proprietary, and efforts to create open models have lagged behind. Often, the best-performing open models rely heavily on synthetic data derived from proprietary systems. Consequently, the AI community lacks critical foundational knowledge to develop high-performing vision-language models (VLMs) from scratch.
Molmo fills this gap. Our VLM pipeline—spanning weights, code, data, and evaluations—is entirely open, starting with a pre-trained vision encoder (CLIP) and language-only LLMs, free from reliance on proprietary models. A core innovation is our highly detailed image caption dataset, built entirely from speech-based human descriptions, and a diverse fine-tuning dataset that includes 2D pointing data, enabling Molmo to respond not only with language but also with gestures. This opens up new opportunities for VLMs to interact with both virtual and physical worlds.
The best model in the Molmo family outshines other open models and compares favorably with leading proprietary systems like GPT-4V, Claude 3.5, and Gemini 1.5. Starting today, we are releasing select model weights, inference code, and a public demo of Molmo-7B-D, with full model weights and data coming soon.
PixMo: Quality Over Quantity in Data
While many VLMs are trained on billions of noisy web-sourced image-text pairs, leading to hallucinations in model output, Molmo takes a different path by emphasizing data quality over quantity. Our models are trained on fewer than 1 million high-quality image-text pairs, allowing them to perform better with significantly less data.
The cornerstone of Molmo's success is its training data, PixMo, which consists of two main types: (1) dense captioning data for multimodal pre-training and (2) supervised fine-tuning data for tasks like question answering, document reading, and pointing. Unlike other approaches that distill existing VLMs, we focus on building from scratch, collecting dense captions from human annotators using speech, a method that yields richer and more detailed descriptions than written annotations. This process generated detailed audio descriptions for 712,000 images across 50 key topics.
Our fine-tuning data includes both academic datasets and newly collected data, designed to enhance capabilities like answering open-ended questions, improving OCR tasks, and pointing to specific elements in images. This pointing ability is crucial for future interactions between VLMs and agents, such as robots identifying objects or web agents locating UI elements.
Benchmarking and Human Evaluations
Vision-language model evaluation is evolving, and academic benchmarks only provide part of the picture. To complement these, we also conduct large-scale human preference evaluations. Our results draw from 325,231 pairwise comparisons across 27 models, the largest human preference study for multimodal models to date.
Our findings show that Molmo performs competitively across both academic benchmarks and human evaluations. In particular:
- MolmoE-1B, a highly efficient model, nearly matches GPT-4V on both benchmarks and user preferences.
- Molmo-7B comfortably sits between GPT-4V and GPT-4o, outperforming Pixtral 12B.
- Our top model, Molmo-72B, achieves the highest academic scores and ranks second in human evaluations, just behind GPT-4o.
Model Architecture
Molmo’s architecture follows a simple yet powerful framework: a pre-trained vision encoder combined with a language model. It consists of four parts:
- Pre-processor: Converts input images into multiscale, multi-crop versions.
- ViT Image Encoder: Maps the images into vision tokens.
- Connector: Projects vision tokens into the language model’s input dimension.
- Decoder-only Transformer LLM: Generates the final output.
Our models use OpenAI's ViT-L/14 CLIP model as the vision encoder and various LLMs at different scales, such as OLMo-7B-1024 and OLMoE-1B-7B-0924, depending on the model size. The training process involves two stages: multimodal pre-training for caption generation and supervised fine-tuning, with all parameters updated throughout.
By innovating in both data collection and model architecture, Molmo pushes the boundaries of what open AI models can achieve, offering a robust, high-performing alternative to proprietary systems. To read full report, you check the official announcement. There you will be able to experiment with a provided demo or other interesting informationpr ovideed, like the technical report presented.
Subscribe to Kavour
Get the latest posts delivered right to your inbox