/ NEWS

NVIDIA and Mistral AI's Mistral-NeMo-Minitron 8B Model-A Leap Forward in LLM Efficiency

NVIDIA and Mistral AI have introduced the Mistral-NeMo-Minitron 8B model, an advanced large language model (LLM) that delivers exceptional accuracy across nine popular benchmarks. This model is a pruned and distilled version of the Mistral NeMo 12B, maintaining high performance while being significantly more efficient.

Pruning and Distillation: Key Techniques

The Mistral-NeMo-Minitron 8B model is created using NVIDIA’s proven approach of model pruning and knowledge distillation. Pruning reduces the model size by removing less critical parts—specifically, by focusing on width pruning, which targets neurons, attention heads, and embedding channels. This approach, when combined with light retraining through knowledge distillation, yields a smaller, faster, and more resource-efficient model without compromising much on predictive power.

Training tokens Wino-Grande 5-shot ARC Challenge 25-shot MMLU 5-shot Hella Swag 10-shot GSM8K 5-shot TruthfulQA 0-shot XLSum en (20%) 3-shot MBPP 0-shot Human Eval 0-shot
Llama 3.1 8B 15T 77.27 57.94 65.28 81.80 48.60 45.06 30.05 42.27 24.76
Gemma 7B 6T 78 61 64 82 50 45 17 39 32
Mistral-NeMo-Minitron 8B 380B 80.35 64.42 69.51 83.03 58.45 47.56 31.94 43.77 36.22
Mistral NeMo 12B N/A 82.24 65.10 68.99 85.16 56.41 49.79 33.43 42.63 23.78
Table 1. Accuracy of the Mistral-NeMo-Minitron 8B base model compared to the teacher Mistral-NeMo 12B, Gemma 7B, and Llama-3.1 8B base models. Bold numbers represent the best among the 8B model class

Model Pruning involves slimming down the model by either dropping layers (depth pruning) or reducing the size of internal components (width pruning). In this case, width pruning was employed, which reduced the MLP intermediate dimensions and hidden sizes while retaining the number of attention heads and layers. This method is preferred as it consistently outperforms depth pruning.

Knowledge Distillation serves to transfer knowledge from a larger "teacher" model (in this case, the Mistral NeMo 12B) to the smaller "student" model (Mistral-NeMo-Minitron 8B). This step involves retraining the pruned model with a smaller dataset, ensuring it maintains high accuracy while being faster and more efficient.

Process and Results

  • Teacher Fine-Tuning: The unpruned 12B model was fine-tuned with 127 billion tokens to correct distribution shifts, ensuring optimal performance during distillation.
  • Width Pruning: Importance scores were calculated for pruning, focusing on compressing the MLP intermediate dimension and the hidden size, while maintaining the number of attention heads and layers.
  • Distillation: The pruned model was distilled using a carefully controlled training process, ensuring minimal loss of accuracy.

The result is a model that not only rivals but often surpasses similar-sized models in accuracy, while also being significantly more resource-efficient.

Conclusion and Future Directions

The Mistral-NeMo-Minitron 8B demonstrates the effectiveness of structured weight pruning combined with knowledge distillation, offering a scalable approach to building efficient and high-performing LLMs. NVIDIA plans to continue refining these techniques, with future efforts aimed at creating even smaller, more accurate models. These innovations will be gradually integrated into the NVIDIA NeMo framework, further advancing the field of generative AI.

This advancement highlights NVIDIA’s commitment to pushing the boundaries of AI model efficiency, making powerful AI more accessible and cost-effective.

To read full report, in the official NVDIA Developer blog, click here.