NVIDIA and Mistral AI's Mistral-NeMo-Minitron 8B Model-A Leap Forward in LLM Efficiency
NVIDIA and Mistral AI have introduced the Mistral-NeMo-Minitron 8B model, an advanced large language model (LLM) that delivers exceptional accuracy across nine popular benchmarks. This model is a pruned and distilled version of the Mistral NeMo 12B, maintaining high performance while being significantly more efficient.
Pruning and Distillation: Key Techniques
The Mistral-NeMo-Minitron 8B model is created using NVIDIA’s proven approach of model pruning and knowledge distillation. Pruning reduces the model size by removing less critical parts—specifically, by focusing on width pruning, which targets neurons, attention heads, and embedding channels. This approach, when combined with light retraining through knowledge distillation, yields a smaller, faster, and more resource-efficient model without compromising much on predictive power.
Training tokens | Wino-Grande 5-shot | ARC Challenge 25-shot | MMLU 5-shot | Hella Swag 10-shot | GSM8K 5-shot | TruthfulQA 0-shot | XLSum en (20%) 3-shot | MBPP 0-shot | Human Eval 0-shot | |
---|---|---|---|---|---|---|---|---|---|---|
Llama 3.1 8B | 15T | 77.27 | 57.94 | 65.28 | 81.80 | 48.60 | 45.06 | 30.05 | 42.27 | 24.76 |
Gemma 7B | 6T | 78 | 61 | 64 | 82 | 50 | 45 | 17 | 39 | 32 |
Mistral-NeMo-Minitron 8B | 380B | 80.35 | 64.42 | 69.51 | 83.03 | 58.45 | 47.56 | 31.94 | 43.77 | 36.22 |
Mistral NeMo 12B | N/A | 82.24 | 65.10 | 68.99 | 85.16 | 56.41 | 49.79 | 33.43 | 42.63 | 23.78 |
Model Pruning involves slimming down the model by either dropping layers (depth pruning) or reducing the size of internal components (width pruning). In this case, width pruning was employed, which reduced the MLP intermediate dimensions and hidden sizes while retaining the number of attention heads and layers. This method is preferred as it consistently outperforms depth pruning.
Knowledge Distillation serves to transfer knowledge from a larger "teacher" model (in this case, the Mistral NeMo 12B) to the smaller "student" model (Mistral-NeMo-Minitron 8B). This step involves retraining the pruned model with a smaller dataset, ensuring it maintains high accuracy while being faster and more efficient.
Process and Results
- Teacher Fine-Tuning: The unpruned 12B model was fine-tuned with 127 billion tokens to correct distribution shifts, ensuring optimal performance during distillation.
- Width Pruning: Importance scores were calculated for pruning, focusing on compressing the MLP intermediate dimension and the hidden size, while maintaining the number of attention heads and layers.
- Distillation: The pruned model was distilled using a carefully controlled training process, ensuring minimal loss of accuracy.
The result is a model that not only rivals but often surpasses similar-sized models in accuracy, while also being significantly more resource-efficient.
Conclusion and Future Directions
The Mistral-NeMo-Minitron 8B demonstrates the effectiveness of structured weight pruning combined with knowledge distillation, offering a scalable approach to building efficient and high-performing LLMs. NVIDIA plans to continue refining these techniques, with future efforts aimed at creating even smaller, more accurate models. These innovations will be gradually integrated into the NVIDIA NeMo framework, further advancing the field of generative AI.
This advancement highlights NVIDIA’s commitment to pushing the boundaries of AI model efficiency, making powerful AI more accessible and cost-effective.
To read full report, in the official NVDIA Developer blog, click here.
Subscribe to Kavour
Get the latest posts delivered right to your inbox