A Breakthrough in Multimodal AI for Edge Devices by Nexa.AI's OmniVision
Nexa AI's OmniVision is a compact multimodal model designed for processing both visual and text inputs, optimized for edge devices, showcasing significant advancements in art analysis, scene comprehension, and more.
Nexa AI's OmniVision represents a significant leap in this domain, offering a sub-billion parameter multimodal model that effectively processes visual and textual inputs. With its recent upgrade to the OmniVision-968M version, the model has improved capabilities in art analysis, scene comprehension, style recognition, color perception, and world knowledge. This article delves into the architecture, training methodology, and performance benchmarks of OmniVision, highlighting its potential applications in edge computing environments.
OmniVision is designed to operate efficiently on edge devices, making it suitable for various applications where computational resources are limited. The model operates with an FP16 version requiring only 988 MB of RAM and 948 MB of storage space. Its architecture consists of three key components:
- Vision Encoder: This component transforms input images into embeddings that can be processed further.
- Projection Layer: It aligns the image embeddings with the token space of the Qwen2.5-0.5B-Instruct model, enabling effective visual-language understanding.
- Response Generation: The model generates responses based on both visual and textual inputs, enhancing its contextual understanding.
The development of OmniVision involved a three-stage training pipeline:
- Stage One: Establishing basic visual-linguistic alignments using image-caption pairs while only unfreezing the projection layer parameters.
- Stage Two: Enhancing contextual understanding through image-based question-answering datasets, allowing the model to generate contextually appropriate responses.
- Stage Three: Implementing Direct Preference Optimization (DPO) by generating responses to images and refining them through a teacher model that produces minimally edited corrections.
A notable challenge in deploying multimodal models on edge devices is the computational overhead associated with processing image tokens. Traditional architectures like LLaVA generate numerous tokens per image, leading to high latency and costs. To address this, OmniVision employs a reshaping mechanism during the projection stage that compresses image embeddings from `[batch_size, 729, hidden_size]` to `[batch_size, 81, hidden_size*9]`. This reduction in token count improves performance significantly while maintaining output quality.
The DPO approach used in OmniVision focuses on generating minimal-edit pairs for training. By ensuring that the teacher model makes small adjustments to the base model's outputs while preserving their structure, this technique enhances output quality without disrupting core capabilities. This method allows for precise improvements in accuracy-critical elements of the model's responses.
The performance of OmniVision has been evaluated against various benchmark datasets including MM-VET, ChartQA, MMMU, ScienceQA, and POPE. The results demonstrate that OmniVision consistently outperforms previous models such as nanoLLAVA:
Benchmark | OmniVision | NanoLLAVA | Qwen2-VL-2B |
---|---|---|---|
MM-VET | 27.5 | 23.9 | 49.5 |
ChartQA (Test) | 59.2 | N/A | 73.5 |
MMMU (Test) | 41.8 | 28.6 | 41.1 |
ScienceQA (Eval) | 62.2 | 59.0 | N/A |
POPE | 89.4 | 84.1 | N/A |
Nexa AI is committed to further developing OmniVision into a fully optimized solution for edge AI multimodal applications. While the current version demonstrates impressive capabilities, ongoing improvements aim to address existing limitations and enhance overall performance.
The launch of OmniVision marks a significant advancement in multimodal AI technology tailored for edge devices. With its efficient architecture and innovative training methodologies, OmniVision stands out as a powerful tool for processing visual and textual data seamlessly. As Nexa AI continues to refine this model, it holds great promise for applications across various sectors where efficient data processing is crucial. Find more details and full report here.
Subscribe to Kavour
Get the latest posts delivered right to your inbox