With an increasing interest among our customers in OS LLMs, primarily due to data protection and privacy concerns, it has become a significant area of exploration to us.

Mixtral 8x7b is a "mixture of experts" (MoE) model released under the very permissive Apache 2 Open Source license. You are essentially free to use, distribute, and modify it as you wish, provided that you give appropriate credit. (More info about Mixtral)

But how good is it? Could it be used as a replacement for the OpenAI models, specifically for a "Retrieval Augmented Generation" (RAG) approach like we do with ZüriCityGPT, the ask-me-anything about the city of Zurich Chatbot? Is it on par with GPT 3.5 for this, which we normally use, due to cost and speed?

TLDR: Test and compare mixtral 8x7b with GPT-3.5 at mixtral.zuericitygpt.ch on ZüriCityGPT's data.

And so, I embarked on this journey

First, I needed a way to serve chat completion requests for this model. Lacking modern GPUs in my arsenal, I opted for runpod.io. After quite a struggle with the details of setting up an OS LLM, I finally managed to get it running with very decent performance (using the EXL2 format with the exllama2_HF loader).

I used the 3.5bpw quantized model since this fits perfectly into a 24GB VRAM GPU, keeping the cost on the lower side of such a setup. Quantization is important to fit a model into as little VRAM as possible. The unquantified model of Mixtral needs about 80GB of VRAM. A quant of 3.5 is maybe on the lower side for good quality, but that was part of the test. To see what’s possible with 24GB of VRAM (which a high-end consumer card like the RTX 3090/4090 does offer).

Next, I made the code base for ZüriCityGPT capable of handling different models and providers. Fortunately, that wasn’t a big task but still required some refactoring. It was definitely worth it; we can now experiment with any model or API provider and adding a new one is now very easy (since most providers just offer OpenAI API compatibility).

Lastly, I extended our frontend and added the possibility to query several providers at once for an easy comparison of the speed and quality of the responses.

Check it out at mixtral.zuericitygpt.ch

See for yourself. Head over to mixtral.zuericitygpt.ch and see the differences in performance and quality between Azure OpenAI's GPT-3.5 model, our selfhosted mixtral 8x7b (as long as we pay for it the next few days) and the Mistral.AI API endpoint for the mistral-small model, which is also mixtral 8x7b. The instructions in the prompt for all three models are exactly the same.

Results and Comparison

The answers by GPT 3.5 are coming in much faster (around 100 tokens/sec) and it also starts significantly faster with answering.

Our “self-hosted” model on runpod.io with an RTX 3090 GPU has a speed of about 40-45 tokens/sec, which is usually faster than one can read.

The model hosted by mistral.ai is a bit of a mixed bag. If all goes well, it also answers with about 50 tokens/sec. But more often than not, it takes quite some time until it starts answering. Since the Mistral API is only recently available (there’s still a waiting list for it), this could also just be scaling issues in the start phase of mistral.

GPT 4-turbo, as a comparison, manages about 10 tokens/sec. And a RTX 4090 is about 10% - 20% faster than the 3090.

The token/sec numbers above include the “time to first token”, it’s the total time it took from sending the request until getting the whole answer. It’s, in my opinion, the more useful number, even though not very scientific for a thorough in-depth comparison.

I also tested the mistral-tiny model with mistral.ai (a small 7b model). But the answers were not up to par with the other models, not surprisingly. It was not bad, just a bit unpredictable in quality. Not something I’d use in production for ZüriCityGPT.

The mistral models also usually give a longer answer than GPT 3.5. With some prompt engineering, I could certainly adjust that one way or another. But in general, the answers are comparable in quality. Also, the German is fine, often a problem with OSS LLMs (which are not fine-tuned on German). I didn’t test French, but I very much assume it’s on a similar level.

Costs and Scalability

As mentioned, this all runs on runpod.io with an RTX 3090 pod with 24GB VRAM. That costs about 300-350$/month, which is reasonably priced, depending on your needs and budget. It, of course, doesn’t really scale; it can only reasonably handle one request at a time (and a request takes several seconds). Scaling up automatically is not that easy with this setup. With, for example, runpod.io’s serverless offering, it could certainly somehow be done. I didn’t look closely into it, 'though.

Embeddings

I quickly also tested generating embeddings for the vector search with an OSS model. That worked fine and performant, but I only tested it with english sources. The currently available models for this are usually not multi-lingual, but I'm sure that will change soon, for example with the embedding possibilities of mistral itself.

For this experiment, I still used the embeddings from OpenAI. But in theory (and also practice) we are now able to run a Chatbot where no data goes outside our own controlled servers.

How to recreate this setup for yourself

If you want to recreate the whole thing for yourself, there’s a short step-by-step guide on my personal blog (also with links to all the components used).

Final thoughts

The Mixtral 8x7b model is a very good model to be used for a RAG Chatbot like ZüriCityGPT. The quality of the answers are, in my humble opinion, as good as with GPT 3.5, also in German. And the speed more than adequate for a Chatbot.

And it all fits into a consumer-grade GPU with 24 GB of VRAM, which you can get for about 300$/month on runpod.io (or buy the actual card for ~1,500 CHF, if you really want to self-host).

On the other hand, you don’t get scalability with this setup. If you have peak times with several queries at once, you need to scale up, which will incur more costs. Using a hosted API, be it via OpenAI, mistral.ai, or, for example, perplexity.ai will still be much less hassle maintenance-wise and in almost every case also much cheaper.

If privacy and data protection are essential or even required for you, then this might be the time to look into it and explore the possibilities of self-hosting an OSS LLM model. Or if you just want to have some fun and not be dependent on the whims of Big Corps.

For ZüriCityGPT, we will continue utilizing Azure OpenAI's GPT-3.5. It offers speed, affordability, is quite maintenance-free, and since we are dealing with publicly available data, there are no data protection concerns (which Azure provides to a great degree also). But we could now run a RAG Chatbot with our framework completely self hosted, where no data goes outside our control

The OSS LLM scene, and currently particularly Mistral, remains worth keeping under close observation and incorporating into one's toolkit. 2024 is gonna be interesting.

Photo by Mathieu CHIRICO on Unsplash