The release of the Code Assistant GitHub Copilot to the public in June 2021 marked the beginning of a new kind of helper in the tool belt of developers – alongside existing ones such as for example linters and formatters.
While basic code completion has been on the market for years with varying degree of complexity, a tool that understands code and completes it in a meaningful way that transcends simple parameter suggestions was a novelty.
This blog article is showing how to build a state-of-the-art Code Assistant using several open source tools created by Hugging Face 🤗:
- Text Generation Inference, the model inference API
- VSCode extension for TGI, the extension that lets you access the model from Visual Studio Code
- Chat UI, a ChatGPT-like UI for the model
… all via a single docker-compose file 🔥! This file and all the others discussed in this article are available in an accompanying repository.
Wait… Have We Been There Already?
Kite was one of the companies that provided a more advanced variant of code completion and gave up on the task for various reasons. In late 2022 the company gave the following explanation:
First, we failed to deliver our vision of AI-assisted programming because we were 10+ years too early to market, i.e. the tech is not ready yet.
We built the most-advanced AI for helping developers at the time, but it fell short of the 10× improvement required to break through because the state of the art for ML on code is not good enough. You can see this in Github Copilot, which is built by Github in collaboration with Open AI. As of late 2022, Copilot shows a lot of promise but still has a long way to go.
But in “late“ 2023 you can run a publicly available model that even beats ChatGPT and old versions of GPT-4 on your personal computer! One year in AI moves blazingly fast and can cover a decade…
Challenge Accepted
Ever since Copilot was released, the open source LLM community tried its best to replicate its functionality. ChatGPT and GPT-4 raised the bar even higher. The release of StarCoder by the BigCode project was a major milestone for the open LLM community: The first truly powerful large language model for code generation that was released to the public under a responsible but nonetheless open license: The code wars had begun and the source was with StarCoder.
While it still performed considerably worse than the proprietary and walled GPT-4 (67 in March) and ChatGPT (48.1) models on the HumanEval benchmark with 32.9 points, it positioned itself successfully within striking distance.
The releases of Llama 2 and subsequently Code Llama – both by Meta – are also important waypoints. Code Llama achieved an impressive HumanEval pass@1 score of 48.8, beating ChatGPT. A few days later WizardCoder builds on top of StarCoder, thereby achieving 73.2 pass@1 which even surpasses GPT-4’s March score!
Why Bother with Self-Hosting?
While Coding Assistant services like GitHub Copilot and tabnine (allows VPC and air-gapped installs) exist already, there are many reasons to self-host one for your company or even yourself.
- Full control over all the moving parts, models and software
- The ability to easily fine-tune models on your own data
- No vendor lock-in
- The fact that by now many of the most capable models are public anyway
- Various compliancy reasons
On August 22, Hugging Face 🤗 announced an enterprise Code Assistant called SafeCoder, which brings together StarCoder (and other models), as well as an inference endpoint and a VSCode extension all in a single managed package. SafeCoder addresses many of the points above, but hides most of its moving parts behind its managed service – by design. Luckily, the main components are open source and readily available. In the following, we will setup everything that is needed to run your very own Coding Assistant serviced by you.
Prerequisites
The best and most performant way to run LLM today is by leveraging GPUs or TPUs. This article assumes that you have a NVIDIA GPU with CUDA support with at least 10 Gigabytes of VRAM at your disposal. Be sure to install an up-to-date driver and CUDA version. You will also need Docker (or another container engine like Podman) and the NVIDIA Container Toolkit.
First Component: The Inference Engine
The core of the Coding Assistant is the backend that is handling the user’s completion requests and generating new tokens based on them. For this we will use huggingface’s Text Generation Inference, which powers Inference Endpoints and the Inference API – a well tested and vital part of huggingface’s infrastructure. Note that the license for the software was slightly changed recently: TGI (text generation inference) from 1.0 onwards uses a new license called HFOIL 1.0, which restricts commercial use. Olivier Dehaene, the maintainer of the project, summarises the implications of the license as follows:
building and selling a chat app for example that uses TGI as a backend is ok whatever the version you use
building and selling a Inference Endpoint like experience using TGI 1.0+ requires an agreement with HF
While this summary should give you a basic understanding of what is possible under the license, be sure to consult a lawyer to get a thorough understanding of whether your use case is covered or not.
The Model: WizardCoder
We will use a quantised and optimised version of a SOTA Code Assistant model called WizardCoder. There are several options available today for quantised models: GPTQ, GGML, GGUF… Tom Jobbins aka “TheBloke“ gives a good introduction here. Since GGUF is not yet available for Text Generation Inference yet, we will stick to GPTQ. For the model to run properly, you will need roughly 10 Gigabytes of available VRAM. If you happen to have more than that available, feel free to try the 34B model, or the slightly better 34B Phind model, which unfortunately is not yet available in a 13B version. Also, check the “Big Code Models Leaderboard“ on huggingface to regularly select the best performing model for your use case.
Setting up Text Generation Inference
Create a docker-compose.yml file with the following contents:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
version: '3.8' services: text-generation: image: ghcr.io/huggingface/text-generation-inference:1.0.3 environment: HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN} ports: - "8080:80" volumes: - ./data:/data command: - "--model-id" - "${MODEL_ID:-TheBloke/WizardCoder-Python-13B-V1.0-GPTQ}" - "--quantize" - "${QUANTIZE:-gptq}" - "--max-batch-prefill-tokens=${MAX_BATCH_PREFILL_TOKENS:-2048}" deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] container_name: text-generation restart: always # Ensuring service always restarts on failure |
Optionally, create an .env file with:
1 2 3 4 5 6 7 |
# optional, only if you want to use a guarded model like StarCoder or Code Llama HUGGING_FACE_HUB_TOKEN=1234 # the model we are going to use MODEL_ID=TheBloke/WizardCoder-Python-13B-V1.0-GPTQ # how the model is quantized QUANTIZE=gptq MAX_BATCH_PREFILL_TOKENS=2048 |
Finally, use sudo docker compose up -d to run the text generation service. It will now be available at localhost:8080. sudo docker container ls gives you a list of all running container instances. Next, type sudo docker logs text-generation –follow to get live-output of the TGI container logs. This is particularly helpful for debugging. As you can see in the logs, TGI will download the model the first time that it is run and save it to the data
folder that is mounted as a volume inside the container.
To test if everything was setup correctly, try to send the following POST request to your API from a new terminal window/tab:
1 |
curl localhost:8080/generate -X POST -d '{"inputs":"write a python functions that gets me all folders in the working directory,"parameters":{"max_new_tokens":200}}' -H 'Content-Type: application/json' |
Now, you should get a response back from the API and also see the request in the container logs! Note that the quality of the response may very well be lacking, since we did not configure any parameters for our request, as this is just to test the basic functionality. You should now have Text Generation Inference up and running on your machine with WizardCoder as a model. Well done!
Second Component: The VSCode Extension
Next, we will setup a plugin for Visual Studio Code that allows us to query TGI conveniently from our IDE! For this we will use huggingface’s VSCode extension available from the marketplace. The plugin is actively developed and thankfully a recent update made it possible to configure the max_new_tokens parameter, which controls how long the model’s response can be. A larger number allows for longer code to be generated but also results in more load.
Setting up the Extension
Once you have installed the plugin, head over to the extension settings. We will need to configure a few parameters:
- First, change the Hugging Face Code: Config Template
to WizardLM/WizardCoder-Python-34B-V1.0 - Next, configure the Hugging Face Code: Model ID Or Endpoint setting and change it to http://YOUR-SERVER-ADDRESS-OR-IP:8080/generate or localhost if TGI runs on the same machine.
To test if everything works as intended, create a new .py file and copy over the following text. Since we are using an instruction model, the model will perform best when prompted properly:
1 |
# write a function that lists all text files in a given directory. use type hints and python docstrings |
Then move your cursor to the end of function definition’s line and hit enter. You should see a spinning circle in the bottom of the window and should be greeted with some (hopefully functional) code!
Third Component: The Chat UI
Would it not be convenient to also be able to access the Code Assistant from your web browser without needing to open an IDE? Certainly! And this is where another great open source software comes into play: huggingface’s Chat UI. It is the very same code that drives the Assistant HuggingChat, which is a very well put together variant of the familiar ChatGPT UI.
Setting up Chat UI
First, clone the repository and create a file called .env.local in its root directory with the following contents:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# url to our local mongodb MONGODB_URL="mongodb://mongo-chatui:27017" # we don't need authorization for our purposes REJECT_UNAUTHORIZED=false # insert your favorite color here PUBLIC_APP_COLOR=blue # overwrite the standard model card with the model we serve via tgi # be sure the edit the 'endpoints' field! MODELS=`[{"name":"TheBloke/WizardCoder-Python-13B-V1.0-GPTQ", "endpoints":[{"url":"http://text-generation:/generate_stream"}], "description":"Programming Assistant", "userMessageToken":"\n\nHuman: ", "assistantMessageToken":"\n\nAssistant:", "preprompt": "You are a helpful, respectful and honest assistant. Below is an instruction that describes a task. Write a response that appropriately completes the request.", "chatPromptTemplate": "{{preprompt}}\n\n### Instruction:\n{{#each messages}}\n {{#ifUser}}{{@root.userMessageToken}}{{content}}{{@root.userMessageEndToken}}{{/ifUser}}\n {{#ifAssistant}}{{@root.assistantMessageToken}}{{content}}{{@root.assistantMessageEndToken}}{{/ifAssistant}}\n{{/each}}\n{{assistantMessageToken}}\n\n### Response:", "promptExamples":[{"title":"Code a snake game","prompt":"Code a basic snake game in python, give explanations for each step."}], "parameters":{"temperature":0.1,"top_p":0.9,"repetition_penalty":1.2,"top_k":50,"truncate":1000,"max_new_tokens":1024}}]` |
There is still a lot of room for improvement especially in the chatPromptTemplate section. See here for further information.
Unfortunately, no prebuilt Docker image exists for Chat UI. Thus, we have to build the image ourselves. The .env and .env.local files are needed at build-time, so be sure to have them ready. Run the following command in the root directory of the Chat UI repository:
1 |
sudo docker build . -t chat-ui:latest |
Next, create a new folder and create a new docker-compose.yml file with the following contents. It is important that the .env file from Chat UI is not in the same folder hierarchy as the docker-compose.yml (hence the new folder), since Docker compose will try to parse and use the .env file in this case case, which will lead to parsing errors due to the JSON string formatting. And we do not need the .env file and its contents at runtime, anyway.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
version: '3.8' services: # The frontend chat-ui: image: chat-ui ports: - "3000:3000" environment: - MONGODB_URL=mongodb://mongo-chatui:27017 container_name: chatui restart: always # Ensuring service always restarts on failure # The database where the history and context are going to be stored mongo-chatui: image: mongo:latest ports: - "27017:27017" container_name: mongo-chatui restart: always # Ensuring service always restarts on failure |
Now, we can test-drive Chat UI. To do so, type in sudo docker compose up -d in the directory of the docker-compose.yml (as before with TGI) and be sure to also keep an eye on the logs via sudo docker container logs chat-ui –follow. If all works as expected, you should be able to access the UI on port 3000!
Putting Everything Together
Besides, it is also possible, of course, to use one combined docker-compose file if you are willing to host the backend, frontend and database on the same machine. Copy the data folder from earlier so the models do not need to be re-downloaded. You might also have to remove the old Chat UI and database containers using sudo docker container remove chat-ui mongo-chatui.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
version: '3.8' services: # Text Generation Inference backend text-generation: image: ghcr.io/huggingface/text-generation-inference:1.0.3 environment: HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN} ports: - "8080:80" - ./data:/data command: - "--model-id" - "${MODEL_ID:-TheBloke/WizardCoder-Python-13B-V1.0-GPTQ}" - "--quantize" - "${QUANTIZE:-gptq}" - "--max-batch-prefill-tokens=${MAX_BATCH_PREFILL_TOKENS:-2048}" deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] container_name: text-generation restart: always # Ensuring service always restarts on failure # The frontend chat-ui: image: chat-ui ports: - "3000:3000" environment: - MONGODB_URL=mongodb://mongo-chatui:27017 container_name: chatui restart: always # Ensuring service always restarts on failure # The database where the history and context are going to be stored mongo-chatui: image: mongo:latest ports: - "27017:27017" container_name: mongo-chatui restart: always # Ensuring service always restarts on failure |
Do not forget to change the endpoints parameter in the MODELS variable of Chat UI’s .env.local to “endpoints“:[{„url“:“http://text-generation:/generate_stream“}], since we now can conveniently use the container address of the shared Docker network. Remember, you have to re-build the image after adapting the .env.local file.
Great! Now you can start the backend, the frontend and the database with one single sudo docker compose -up -d.
Bonus: Adding HTTPS
Up to this point, the API and UI are all served only via HTTP. It is therefore advisable to better secure our traffic with HTTPS and the help of a reverse proxy like nginx. Without HTTPS, you will not be able to access the UI from other destinations as localhost.
Create a new directory called nginx and inside of it a new file nginx.conf. The specific settings depend on what local registrar you are using – in case you only want to make the service available to your local network.
This nginx.conf template can serve as a starting point:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
events { worker_connections 1024; } http { server_tokens off; charset utf-8; server { listen 80 default_server; listen [::]:80 default_server; location /nginx_status { stub_status on; } } # Frontend server { listen 443 ssl http2; listen [::]:443 ssl http2; server_name your.local.address.io; client_max_body_size 15G; ... # reverse proxy location / { proxy_pass http://chat-ui:3000; ... } } # Serving backend server { listen 443 ssl http2; listen [::]:443 ssl http2; server_name api.your.local.address.io; client_max_body_size 15G; ... # reverse proxy location / { proxy_pass http://text-generation:80; ... } } # HTTP redirect server { listen 80; listen [::]:80; server_name .your.local.address.io; return 301 https://your.local.address.io$request_uri; } } |
You also need to add the nginx service to your existing docker-compose.yml.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
version: '3.8' services: ... # The reverse proxy nginx: container_name: nginx restart: unless-stopped image: nginx ports: - 80:80 - 443:443 volumes: - ./nginx/nginx.conf:/etc/nginx/nginx.conf - ./certificates:/certificates |
Now you only need to generate the certificates, save them in the certificates folder and restart everything.
This is it!
Good job. You now have all the components needed to self-host our very own Code Assistant. Thanks to the awesome people at huggingface, it is easier than ever. And maybe you even learned a thing or two along the way. Before you put it in production though, you may want to do a final load test, e.g. via locust. Doing so, you get an understanding of how many users are able to use the service at the same time. For this you will need to write a small locust-file.py – and for that you could kindly ask WizardCoder to help you out 🧙♀️.