RAG is not enough: Lessons from Beating GPT-3.5 on Specialized Tasks with Mistral 7B

Abstract

We present a straightforward approach to customizing small, open-source models using fine-tuning and RAG that outperforms GPT-3.5 for specialized use cases. With it, we achieved superior Q&A results of technical documentation (opens in a new tab) for a small codebase codebase (opens in a new tab).

In short, (1) extending a general foundation model like Mistral (opens in a new tab) with strong math and coding, and (2) training it over a high-quality, synthetic dataset generated from the intended corpus, and (3) adding RAG capabilities, can lead to significant accuracy improvements.

Problems still arise with catastrophic forgetting in general tasks, commonly observed during specilizied domain fine-tuning. In our case, this is likely exacerbated by our lack of access to Mistral’s original training dataset and various compression techniques used in our approach to keep the model small.

Selecting a strong foundation model

Mistral 7B (opens in a new tab) outshines both Meta's Llama-2 7B (opens in a new tab) and Google's Gemma 7B (opens in a new tab) in key benchmarks, making it our choice for a base model. Starting with a strong foundation like Mistral allowed us to achieve greater accuracy in our specialized adaptations.

Figure 1. Mistral 7B excels in benchmarks, ranking among the top foundational models.

Note: We are not sponsored by the Mistral team, though lots of folks like to run Mistral locally using Jan (opens in a new tab)., our desktop client.

Cost effectively improving the base model

Our technical use case required excellent math capabilities, an area where Mistral can underperform. Thus, we tested a number of Mistral model variants, from foundation models to finetunes to model merges, to find a stronger base model before proceeding to finetuning.

Figure 2: The merged model, Stealth, doubles the mathematical capabilities of its foundational model while retaining the performance in other tasks.

We found merging models is quick and cost-effective, enabling fast adjustments based on the result of each iteration.

We ended up with Stealth 7B v1.1 (opens in a new tab), a SLERP (opens in a new tab) merge of Mistral with the following:

WizardMath (opens in a new tab) for its math capabilities.
WizardCoder (opens in a new tab) for its coding capabilities.
Our own Trinity (opens in a new tab) model for its versatility across general tasks.

This particular combination yielded the best tradeoff across mathematical & technical reasoning while retaining the most pre-merge performance on general tasks.

DPO finetuning

Merging different LLMs can lead to a mixed answering style because each model was originally trained on different types of data.

Thus, we applied Direct Preference Optimization (DPO (opens in a new tab)) using the Intel's Orca DPO pairs (opens in a new tab) dataset, chosen for its helpful answering style in general, math and coding concentration.

This approach produced a final model - Stealth 7B v1.2 (opens in a new tab), aligned to our technical preferences and demonstrating minimal loss.

Using our own technical documentation

With the base model ready, we started on our specific use case.

Jan is an open-source project enjoying strong growth, but at one point we began receiving a new support ticket every minute, which quickly overwhelmed our bootstrapped resources.

So, we directed our efforts toward training a model to answer user questions based on existing technical documentation.

Specifically, we trained it on on the Nitro documentation (opens in a new tab). For context, Nitro is the default inference engine for Jan. It’s a enterprise-ready server implementation of LlamaCPP, written in C++, with multimodal, queues, and other production-level server capabilities.

It made an interesting corpus because it was rife with post-2023 technical jargon, edge cases, and poor informational layout.

Generating training data

The first step was to transform Nitro’s unstructured format into a synthetic Q&A dataset designed for instruction tuning (opens in a new tab).

The text was split into chunks of 300-token segments with 30-token overlaps. This helped to avoid a lost-in-the-middle (opens in a new tab) problem where LLM can’t use context efficiently to answer given questions.

The chunks were then given to GPT-4 with 8k context length to generate 3800 Q&A pairs. The training dataset (opens in a new tab) is available on HuggingFace.

Training

Training was done with supervised finetuning (SFT) from the Hugging Face's alignment-handbook (opens in a new tab), per Huggingface's Zephyr Beta (opens in a new tab) guidelines.

We used consumer-grade, dual Nvidia RTX 4090s for the training. The end-to-end training took 18 minutes. We found optimal hyperparameters in LoRA for this specific task to be r = 256 and alpha = 512.

This final model can be found here on Huggingface (opens in a new tab).

Figure 3. Using the new finetuned model in Jan (opens in a new tab)

Improving results with RAG

As an additional step, we also added Retrieval Augmented Generation (RAG) (opens in a new tab) as an experiment parameter.

A simple RAG setup was done using Llamaindex (opens in a new tab) and the bge-en-base-v1.5 embedding (opens in a new tab) model for efficient documentation retrieval and question-answering. You can find the RAG implementation here (opens in a new tab).

Benchmarking the Results

We curated a new set of 50 multiple-choice questions (opens in a new tab) (MCQ) based on the Nitro docs. The questions had varying levels of difficulty and had trick components that challenged the model's ability to discern misleading information.

Figure 4. Comparation between finetuned model and OpenAI's GPT

Results

GPT-3.5 with RAG: 56.7%
GPT-4 with RAG: 64.3%
Merged 7B Model (Stealth 7B (opens in a new tab)) with RAG: 47.7%
Finetuned 7B Model (Nitro 7B) with RAG: 57.8%

This indicates that with task-specific training, we can improve an open-source, Small Language Model to the level of GPT-3.5 on domain knowledge.

Notably, the finetuned + RAG approach also demonstrated more consistency across benchmarking, as indicated by its lower standard deviation.

Conclusion

We conclude that this combination of model merging + finetuning + RAG yields promise. This finding is relevant for teams and individuals that need specialized, technical small language models that need to run in resource-constrained or highly secured environments, where GPT may not be an option.

Anecdotally, we’ve had some success using this model in practice to onboard new team members to the Nitro codebase.

A full research report with more statistics can be found here (opens in a new tab).

References

The Soul of a New Machine

To stay updated on all of Jan's research, subscribe to The Soul of a New Machine