21 August 2024 / NEWS

NVIDIA and Mistral AI's Mistral-NeMo-Minitron 8B Model-A Leap Forward in LLM Efficiency

NVIDIA and Mistral AI have introduced the Mistral-NeMo-Minitron 8B model, an advanced large language model (LLM) that delivers exceptional accuracy across nine popular benchmarks. This model is a pruned and distilled version of the Mistral NeMo 12B, maintaining high performance while being significantly more efficient.

Pruning and Distillation: Key Techniques

The Mistral-NeMo-Minitron 8B model is created using NVIDIA’s proven approach of model pruning and knowledge distillation. Pruning reduces the model size by removing less critical parts—specifically, by focusing on width pruning, which targets neurons, attention heads, and embedding channels. This approach, when combined with light retraining through knowledge distillation, yields a smaller, faster, and more resource-efficient model without compromising much on predictive power.

	Training tokens	Wino-Grande 5-shot	ARC Challenge 25-shot	MMLU 5-shot	Hella Swag 10-shot	GSM8K 5-shot	TruthfulQA 0-shot	XLSum en (20%) 3-shot	MBPP 0-shot	Human Eval 0-shot
Llama 3.1 8B	15T	77.27	57.94	65.28	81.80	48.60	45.06	30.05	42.27	24.76
Gemma 7B	6T	78	61	64	82	50	45	17	39	32
Mistral-NeMo-Minitron 8B	380B	80.35	64.42	69.51	83.03	58.45	47.56	31.94	43.77	36.22
Mistral NeMo 12B	N/A	82.24	65.10	68.99	85.16	56.41	49.79	33.43	42.63	23.78

Table 1. Accuracy of the Mistral-NeMo-Minitron 8B base model compared to the teacher Mistral-NeMo 12B, Gemma 7B, and Llama-3.1 8B base models. Bold numbers represent the best among the 8B model class

Model Pruning involves slimming down the model by either dropping layers (depth pruning) or reducing the size of internal components (width pruning). In this case, width pruning was employed, which reduced the MLP intermediate dimensions and hidden sizes while retaining the number of attention heads and layers. This method is preferred as it consistently outperforms depth pruning.

Knowledge Distillation serves to transfer knowledge from a larger "teacher" model (in this case, the Mistral NeMo 12B) to the smaller "student" model (Mistral-NeMo-Minitron 8B). This step involves retraining the pruned model with a smaller dataset, ensuring it maintains high accuracy while being faster and more efficient.

Process and Results

Teacher Fine-Tuning: The unpruned 12B model was fine-tuned with 127 billion tokens to correct distribution shifts, ensuring optimal performance during distillation.
Width Pruning: Importance scores were calculated for pruning, focusing on compressing the MLP intermediate dimension and the hidden size, while maintaining the number of attention heads and layers.
Distillation: The pruned model was distilled using a carefully controlled training process, ensuring minimal loss of accuracy.

The result is a model that not only rivals but often surpasses similar-sized models in accuracy, while also being significantly more resource-efficient.

Conclusion and Future Directions

The Mistral-NeMo-Minitron 8B demonstrates the effectiveness of structured weight pruning combined with knowledge distillation, offering a scalable approach to building efficient and high-performing LLMs. NVIDIA plans to continue refining these techniques, with future efforts aimed at creating even smaller, more accurate models. These innovations will be gradually integrated into the NVIDIA NeMo framework, further advancing the field of generative AI.

This advancement highlights NVIDIA’s commitment to pushing the boundaries of AI model efficiency, making powerful AI more accessible and cost-effective.

To read full report, in the official NVDIA Developer blog, click here.

NVIDIA and Mistral AI's Mistral-NeMo-Minitron 8B Model-A Leap Forward in LLM Efficiency

Pruning and Distillation: Key Techniques

Process and Results

Conclusion and Future Directions

Enhancing retrieval augmented generation through drafting

Automating Thought of Search - A Journey Towards Soundness and Completeness

Pruning and Distillation: Key Techniques

Process and Results

Conclusion and Future Directions

Subscribe to Kavour

Subscribe to Kavour