At the end of last year chatGPT was all the rage – OK, maybe it still is. But in the meantime not all of us want to support OpenAI’s efforts to train it further with our unpaid time and input via a costly API. The solution: open source AI models!
Since the somewhat unintended leak of the weights of Meta’s open source LlaMA model we now have an alternative. In March this year, Georgi Gerganov released llama.cpp, a port of Meta’s model that can run on a CPU with fairly normal specs instead of the high level NVIDIA GPU’s large language models normally require. How so? The model is quantized , reducing it from 16 bit to 4 bits by turning floating point numbers in the matrices into integers. This tweak still gets sensible answers out of the model and speeds up computing time by reducing size. Apparently the hack was put together in an evening. If you’re interested in a high level overview of quantization, I recommend this general introduction by HuggingFace.
LlaMA currently has model versions available ranging from 7B to 65B (billion) parameters. Only the two smallest ones, 7B and 13B, would be able to run on a Raspberry Pi 4. Their RAM requirements are at 3.9 GB and 7.8 GB with 4-bit quantization. More details here.
It’s all in the instructions
In the past weeks I’ve tested several versions of llama.cpp. The main release, when combined with a web application and a user interface such as oobabooga/text-generation-webui runs smoothly. It does have a chat and a Q&A mode, but is mostly focused on text generation and seems a bit clunky to me. The model hasn’t been fine-tuned on instructions. You have to add your questions as prompts in a structured text that will be continuously built by the bot. The answers often aren’t usable and the model seems to hallucinate a lot.
My main goal was to mimic the chatGPT experience – find a model that worked well in ‘assistant mode’. Several versions of LlaMA have already been fine-tuned to do this: Alpaca is probably the most popular model right now. It is fine-tuned from a 7B LLaMA model using 52K instruction-following data. This way the researchers trained the model to follow instructions and behave in a similar way to chatGPT.
A really cool user interface that is easy to install is Dalai LlaMA. Unfortunately it only works with the original Llama.cpp model right now and not with Alpaca. At least for me and many other people on the web.
OK, so that’s a lot of animal names. But which model and user interface are best?
Although they vastly outperformed my GPT2-based chatbot, I wasn’t happy with the performance of Llama.cpp or Dalai Llama or Alpaca when it came to actual dialog. A model that better mimicked the chatbot setting we’ve come to like about chatGPT was Koala. Koala has also been trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web. The difference is that it was evaluated by 100 human users instead of the 5 human users who assessed Alpaca.
The reason why chatGPT recapitulates realistic dialog so well is because it was evaluated by an unspecified number of human trainers. OpenAI calls this Reinforcement Learning from Human Feedback. The underlying technology is otherwise pretty much the same for all these models – transformers.
A great thing about Koala is that ChaoticByte has already written a neat frontend for a web application – Eucalyptus Chat! And of course you can run it on a Raspberry Pi. The following instructions are for a local setup that will only be accessible within your Raspberry Pi’s network. But if you would like to test my server-hosted Koalabot, feel free to do so here. I will include instructions on how to set it up in a later blog post.
Setting up the system and virtual environment
Eucalyptus-Chat is super easy to install. But it’s best to stay in a virtual environment to separate it from other active processes.
sudo apt-get update sudo apt-get install python3-pip python3-dev build-essential libssl-dev libffi-dev python3-setuptools sudo -H pip3 install --upgrade pip sudo apt-get install python3-venv
As this model will only run locally, you can install it wherever you like. On your terminal navigate to the directory you’d like to download the files to. This will be your directory for the virtual environment as well.
git clone https://github.com/ChaoticByte/Eucalyptus-Chat.git cd Eucalyptus-Chat python3 -m venv koalabotenv source koalabotenv/bin/activate
From now on, ensure that you are in the virtual environment when you install new modules. It will be marked by the
koalabotenv tag in the terminal.
(koalabotenv) pip install -r requirements.txt
Getting the Koala model
Before we start up the API and Frontend-Servers to serve Koala, we have to obtain a Koala model in the ggml format that has already been quantized. You can of course quantize the official Koala model yourself following the instructions on their website. Or you can download one of the models published on HuggingFace that have already been quantized. Here is one that should work well. Please note that it is the 7B parameter version, which should run without problems on the Raspberry Pi. For Koala, the 7B version requires about 5 GB RAM. If you’d like to try larger models you can find them on HuggingFace as well. Be warned: the 13B Koala model requires 16GB RAM.
koala-7B.ggml.q4_0.bin file directly in the models folder of your Eucalyptus-Chat directory and you’re ready to go. The
wget code below should do the trick directly, but I have noticed that it sometimes does not download the full 4.21 GB model but only a few KBs. In that case, download the model from the website and manually copy it into your
(koalabotenv) cd models (koalabotenv) wget https://huggingface.co/TheBloke/koala-7B-GGML/resolve/main/koala-7B.ggml.q4_0.bin
Starting backend and frontend servers
Go back to your Eucalyptus-Chat folder. My code assumes you’ve downloaded it into your home folder. Start up the API and Frontend servers. It might be easiest to start them separately in two terminal windows, for instance by splitting your terminal if it allows. Ensure you stay in the virtual environment while doing this (if you’re in a new terminal you might need to type
source koalabotenv/bin/activate again).
(koalabotenv) cd ~/Eucalyptus-Chat (koalabotenv) python3 api-server.py -m models/koala-7B.ggml.q4_0.bin (koalabotenv) python3 frontend-server.py
A few additional details on this step: starting the api-server will take a LONG time on the Raspberry Pi. This is a huge chunk for it to load. The server will be ready when you see a note announcing that uvicorn is running on localhost:port. The default host and port arguments for the api-server will be localhost:7331 and for the frontend server localhost:8080. Additional possible arguments are here.
Access the chatbot on http://localhost:8080 after everything has loaded successfully. This is the default URL for the frontend.
Have a chat and see how your Koalabot responds. You can play with the settings as well. A higher max_tokens will give you a longer reply, higher temperature more randomness (but responses might be incorrect). As you will see, the Raspberry Pi does struggle a bit creating the responses and they will not appear immediately. But it does work, which is quite amazing really!
Error when loading the frontend in a browser. In case the last step does not work, try replacing localhost with 0.0.0.0.
Error loading model: unknown (magic, version) combination: 67676a74, 00000002; is this really a GGML file?
LlaMA.cpp recently released a new quantization method. Some of the new models now do not work with the old llama.cpp model and vice versa. Ensure your models are up to date. HuggingFace already provides the new quantized Koala models on the link I have added here. The old models can be found in the branch
previous_llama. The latest Eucalyptus Chat version I have tested this on worked with the previous model called koala-7B-4bit-128g.GGML.bin.