Private RAG with Lingo, Verba and Weaviate

In this post you will setup a Retrieval-Augmented Generation stack on top of Kubernetes. You will deploy the Verba application from Weaviate and wire it up to Lingo which provides local LLM inference and embedding servers. Now you can take full control over your data and models.

private rag architecture

You will use the following components:

  • Verba as the RAG application
  • Weaviate as the Vector DB
  • Lingo as the model proxy and autoscaler
  • Mistral-7B-Instruct-v2 as the LLM
  • STAPI with MiniLM-L6-v2 as the embedding model

K8s cluster creation (optional)

You can skip this step if you already have a K8s cluster with GPU nodes.

Create a GKE cluster with a CPU and 1 x L4 GPU nodepool:

bash <(curl -s https://raw.githubusercontent.com/substratusai/lingo/main/deploy/create-gke-cluster.sh)

Make sure you review the script before executing!

NOTE: Even though this script is for GCP, the components here will work on any Kubernetes cluster (AWS, Azure, etc). Reach out on discord if you get stuck!

Installation

Now let's use Helm to install the components on your K8s cluster.

Add the required Helm repos:

helm repo add weaviate https://weaviate.github.io/weaviate-helm
helm repo add substratusai https://substratusai.github.io/helm
helm repo update

Deploy Mistral 7b instruct v2

Create a file named mistral-v02-values.yaml with the following content:

model: mistralai/Mistral-7B-Instruct-v0.2
replicaCount: 1
# Needed to fit in 24GB GPU memory
maxModelLen: 15376
servedModelName: mistral-7b-instruct-v0.2
chatTemplate: /chat-templates/mistral.jinja
env:
- name: HF_TOKEN
  value: ${HF_TOKEN}
resources:
  limits:
    nvidia.com/gpu: 1
deploymentAnnotations:
  lingo.substratus.ai/models: mistral-7b-instruct-v0.2
  lingo.substratus.ai/min-replicas: "1" # needs to be string
  lingo.substratus.ai/max-replicas: "3" # needs to be string

Export your HuggingFace token (required to pull Mistral):

export HF_TOKEN=replaceMe!

Install Mistral-7b-instruct-v2 using the token you exported in previous step:

envsubst < mistral-v02-values.yaml | helm upgrade --install mistral-7b-instruct-v02 substratusai/vllm -f -

Installing Mistral can take a few minutes. Kubernetes will first try to scale up the GPU nodepool and then the model will be downloaded and loaded into memory.

Use the following commands to view the logs:

kubectl get pods -l app.kubernetes.io/instance=mistral-7b-instruct-v02 -w
kubectl logs -l app.kubernetes.io/instance=mistral-7b-instruct-v02

IMPORTANT: You will be paying for the GPU usage while the model is running, because min scale is set to 1. Make sure to uninstall the Helm release (mistral-7b-instruct-v02) when you are done using the model!

Deploying Embedding Model Server

We are going to deploy STAPI - Sentence Transformers API, an embedding model server with an OpenAI compatible endpoint.

Create a file called stapi-values.yaml with the following content:

deploymentAnnotations:
  lingo.substratus.ai/models: text-embedding-ada-002
  lingo.substratus.ai/min-replicas: "1" # needs to be string
model: all-MiniLM-L6-v2
replicaCount: 0

Install STAPI using the values file you created:

helm upgrade --install stapi-minilm-l6-v2 substratusai/stapi -f stapi-values.yaml

Deploy Lingo

Lingo provides a unified endpoint for both the LLM and embedding model. It proxies requests and autoscales the models based on load. It can be thought of as a OpenAI drop-in replacement for running inference locally.

Install Lingo with Helm:

helm upgrade --install lingo substratusai/lingo

You can reach Lingo from your local machine by starting a port-forward:

kubectl port-forward svc/lingo 8080:80

Test out the embedding server with the following curl command:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Lingo rocks!",
    "model": "text-embedding-ada-002"
  }'

Mistral can be called via the "completions" API:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-7b-instruct-v0.1", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'

Deploy Weaviate

Let's deploy Weaviate with the OpenAI text2vec module and a single replica.

Create a file called weaviate-values.yaml with the following content:

modules:
  text2vec-openai:
    enabled: true
    apiKey: 'thiswillbeignoredbylingo'

service:
  type: ClusterIP

Install Weaviate using the values file you created:

helm upgrade --install weaviate weaviate/weaviate -f weaviate-values.yaml

Deploying Verba

Verba is the RAG application that utilizes Lingo and Weaviate.

Create a file called verba-values.yaml with the following content:

env:
- name: OPENAI_MODEL
  value: mistral-7b-instruct-v0.2
- name: OPENAI_API_KEY
  value: ignored-by-lingo
- name: OPENAI_BASE_URL
  value: http://lingo/v1
- name: WEAVIATE_URL_VERBA
  value: http://weaviate:80

We set the OPENAI_BASE_URL to the Lingo endpoint and WEAVIATE_URL_VERBA to the Weaviate endpoint. This will configure Verba to call Lingo instead of OpenAI, keeping all data local to the cluster.

Install Verba from Weaviate using the values file you created:

helm upgrade --install verba substratusai/verba -f verba-values.yaml

Usage

Now that everything is deployed, you can try using Verba.

The easiest way to access Verba is through a port-forward (don't forget to terminate the previous port-forward command first):

kubectl port-forward service/verba 8080:80

Now go to http://localhost:8080 in your browser. Try adding a document and asking some relevant questions.

For example, download a PDF document about Nasoni Smart Faucet here.

Upload the document inside Verba and ask questions like:

  • How did they test the Nasoni Smart Facet?
  • What's a Nasoni Smart Facet?

Conclusion

You now have a fully private RAG setup with Weaviate and Lingo. This allows you to keep your data and models private and under your control. No more expensive LLM calls or OpenAI rate limits. πŸš€

Like what you saw? Give Lingo a star on GitHub ⭐

Deploying Mixtral on GKE with 2 x L4 GPUs

A100 and H100 GPUs are hard to get. They are also expensive. What if you could run Mixtral on just 2 x L4 24GB GPUs? The L4 GPUs are more attainable today (Feb 10, 2024) and are also cheaper. Learn how to easily deploy Mixtral on GKE with 2 x L4 GPUs in this blog post.

mixtral on gke

How much GPU memory is needed? Will it fit on 2 x L4?
For this post, we're using GPTQ quantization to load the model parameters in 4 bit. The estimated GPU memory when using 4bit GPTQ quantization would be:

M=(7βˆ—8βˆ—109βˆ—4bytes)(32/4)βˆ—1.2=33.6GB M = \dfrac{(7 * 8*10^9 * 4 \mathrm{bytes})}{ (32 / 4)} * 1.2 = 33.6\mathrm{GB}

A single L4 GPU has 24GB of GPU memory so 2 x L4 will have 48GB, which is more than enough to serve the Mixtral 7 * 8 billion parameter model. You can read the Calculating GPU memory for serving LLMs blog post for more information.

The post is structured in the following sections:

  1. Creating a GKE cluster with a spot L4 GPU node pool
  2. Downloading the model into a ReadManyOnly PVC
  3. Deploying Mixtral GPTQ using Helm and vLLM
  4. Trying out some prompts with Mixtral

Creating the GKE cluster

Create a cluster with a CPU nodepool for system services:

export CLUSTER_LOCATION=us-central1
export PROJECT_ID=$(gcloud config get-value project)
export CLUSTER_NAME=substratus
export NODEPOOL_ZONE=us-central1-a
gcloud container clusters create ${CLUSTER_NAME} --location ${CLUSTER_LOCATION} \
  --machine-type e2-medium --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --autoscaling-profile optimize-utilization --enable-autoscaling \
  --node-locations ${NODEPOOL_ZONE} --workload-pool ${PROJECT_ID}.svc.id.goog \
  --enable-image-streaming --enable-shielded-nodes --shielded-secure-boot \
  --shielded-integrity-monitoring \
  --addons GcsFuseCsiDriver

Create a GPU nodepool where each VM has 2 x L4 GPUs and uses Spot pricing:

gcloud container node-pools create g2-standard-24 \
  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
  --machine-type g2-standard-24 --ephemeral-storage-local-ssd=count=2 \
  --spot --enable-autoscaling --enable-image-streaming \
  --num-nodes=0 --min-nodes=0 --max-nodes=3 --cluster ${CLUSTER_NAME} \
  --node-locations "${NODEPOOL_ZONE}" --location ${CLUSTER_LOCATION}

Downloading the model into a ReadManyOnly PVC

Downloading a 8 * 7 billion parameter model every time you launch an inference server takes a long time and will be expensive in egress costs. Instead, we'll download the model to a Persistent Volume Claim. That PVC will be used in ReadManyOnly mode across all the Mixtral serving instances.

Create a file named pvc.yaml with the following content:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mixtral-8x7b-instruct-gptq
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi

Create the PVC to store the model weights on:

kubectl apply -f pvc.yaml

The following Job will download the model to the PVC mixtral-8x7b-instruct-gptq using the Huggingface Hub API. The model is downloaded to the /model directory in the PVC. The revision parameter is set to gptq-4bit-32g-actorder_True to download the model with GPTQ quantization in 4 bit.

Create a file named load-model-job.yaml with the following content:

apiVersion: batch/v1
kind: Job
metadata:
  name: load-model-job-mixtral-8x7b-instruct-gptq
spec:
  template:
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: mixtral-8x7b-instruct-gptq
      containers:
      - name: model-loader
        image: python:3.11
        volumeMounts:
        - mountPath: /model
          name: model
        command:
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub
          python3 - << EOF
          from huggingface_hub import snapshot_download
          model_id="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
          snapshot_download(repo_id=model_id, local_dir="/model", cache_dir="/model",
                            local_dir_use_symlinks=False,
                            revision="gptq-4bit-32g-actorder_True")
          EOF
      restartPolicy: Never

Launch the load model Job:

kubectl apply -f load-model-job.yaml

You can watch the progress of the Job using the following command:

kubectl logs -f job/load-model-job-mixtral-8x7b-instruct-gptq

After a few minutes, the model will be downloaded to the PVC.

Deploying Mixtral using Helm

We are maintaining a Helm chart for vLLM. The Helm chart is available at Github substratusai/helm. We've also published a container image for vLLM that takes environment variables. The vLLM container image is available at Github substratusai/vllm-docker.

Install the Helm repo:

helm repo add substratusai https://substratusai.github.io/helm
helm repo update

Create a file named values.yaml with the following content:

model: /model
servedModelName: mixtral-8x7b-instruct-gptq
readManyPVC:
  enabled: true
  sourcePVC: "mixtral-8x7b-instruct-gptq"
  mountPath: /model
  size: 30Gi

quantization: gptq
dtype: half
maxModelLen: 8192
gpuMemoryUtilization: "0.8"

resources:
  limits:
    nvidia.com/gpu: 2

nodeSelector:
  cloud.google.com/gke-accelerator: nvidia-l4

replicaCount: 1

Notice that we're specifying a readManyPVC with the sourcePVC set to mixtral-8x7b-instruct-gptq, which we created in the previous step. The mountPath is set to /model and model parameter for vLLM points to that local path. The quantization parameter is set to gptq and the dtype parameter is set to half. Setting it to half is required for GPTQ quantization. The gpuMemoryUtilization is set to 0.8, because otherwise you will get an out of GPU memory error. The replicaCount is set to 1 and so the moment you install the Helm chart, the deployment will start with 1 pod.

Looking for an autoscaling Mixtral deployment that supports scale from 0? Take a look at Lingo: ML Proxy and autoscaler for K8s.

Install the Helm chart:

helm install mixtral-8x7b-instruct-gptq substratusai/vllm -f values.yaml

After a while you can check whether pods are running:

kubectl get pods

Once the pods are running, check logs of the deployment pods:

kubectl logs -f deployment/mixtral-8x7b-instruct-gptq

Sent some prompts to Mixtral

A K8s Service of type ClusterIP named mixtral-instruct-gptq-vllm was created by the Helm chart. The Service by default is only accessible from within the cluster. You can use kubectl port-forward to forward the Service to your local machine.

kubectl port-forward service/mixtral-8x7b-instruct-gptq-vllm 8080:80

Sent a prompt to the Mixtral model using the following command:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mixtral-8x7b-instruct-gptq", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'

Got more questions? Don't hesitate to join our Discord and ask away.

discord-invite

Calculating GPU memory for serving LLMs

How many GPUs do I need to be able to serve Llama 70B? In order to answer that, you need to know how much GPU memory will be required by the Large Language Model.

The formula is simple:

M=(Pβˆ—4B)(32/Q)βˆ—1.2 M = \dfrac{(P * 4B)}{ (32 / Q)} * 1.2

Symbol Description
M GPU memory expressed in Gigabyte
P The amount of parameters in the model. E.g. a 7B model has 7 billion parameters.
4B 4 bytes, expressing the bytes used for each parameter
32 There are 32 bits in 4 bytes
Q The amount of bits that should be used for loading the model. E.g. 16 bits, 8 bits or 4 bits.
1.2 Represents a 20% overhead of loading additional things in GPU memory.

Now let's try out some examples.

GPU memory required for serving Llama 70B

Let's try it out for Llama 70B that we will load in 16 bit. The model has 70 billion parameters.

70βˆ—4bytes32/16βˆ—1.2=168GB \dfrac{70 * 4 \mathrm{bytes}}{32 / 16} * 1.2 = 168\mathrm{GB}

That's quite a lot of memory. A single A100 80GB wouldn't be enough, although 2x A100 80GB should be enough to serve the Llama 2 70B model in 16 bit mode.

How to further reduce GPU memory required for Llama 2 70B?

Quantization is a method to reduce the memory footprint. Quantization is able to do this by reducing the precision of the model's parameters from floating-point to lower-bit representations, such as 8-bit integers. This process significantly decreases the memory and computational requirements, enabling more efficient deployment of the model, particularly on devices with limited resources. However, it requires careful management to maintain the model's performance, as reducing precision can potentially impact the accuracy of the outputs.

In general, the consensus seems to be that 8 bit quantization achieves similar performance to using 16 bit. However, 4 bit quantization could have a noticeable impact to the model performance.

Let's do another example where we use 4 bit quantization of Llama 2 70B:

70βˆ—4bytes32/4βˆ—1.2=42GB \dfrac{70 * 4 \mathrm{bytes}}{32 / 4} * 1.2 = 42\mathrm{GB}

This is something you could run on 2 x L4 24GB GPUs.

Relevant tools and resources

  1. Tool for checking how many GPUs you need for a specific model
  2. Transformer Math 101

Got more questions? Don't hesitate to join our Discord and ask away.

discord-invite

Deploying Mistral 7B Instruct on K8s using TGI

mistral 7b k8s helm

Learn how to use the text-generation-inference (TGI) Helm Chart to quickly deploy Mistral 7B Instruct on your K8s cluster.

Add the Substratus.ai Helm repo:

helm repo add substratusai https://substratusai.github.io/helm

This command adds a new Helm repository, making the text-generation-inference Helm chart available for installation.

Create a configuration file named values.yaml. This file will contain the necessary settings for your deployment. Here’s an example of what the content should look like:

model: mistralai/Mistral-7B-Instruct-v0.1
# resources: # optional, override if you need more than 1 GPU
#   limits:
#     nvidia.com/gpu: 1
# nodeSelector: # optional, can be used to target specific GPUs
#   cloud.google.com/gke-accelerator: nvidia-l4

In this configuration file, you are specifying the model to be deployed and optionally setting resource limits or targeting specific nodes based on your requirements.

With your configuration file ready, you can now deploy Mistral 7B Instruct using Helm:

helm install mistral-7b-instruct substratusai/text-generation-inference \
    -f values.yaml

This command initiates the deployment, creating a Kubernetes Deployment and Service based on the settings defined in your values.yaml file.

After initiating the deployment, it's important to ensure that everything is running as expected. Run the following command to get detailed information about the newly created pod:

kubectl describe pod -l app.kubernetes.io/instance=mistral-7b-instruct

This will display various details about the pod, helping you to confirm that it has been successfully created and is in the right state. Note that depending on your cluster's setup, you might need to wait for the cluster autoscaler to provision additional resources if necessary.

Once the pod is running, check the logs to ensure that the model is initializing properly:

kubectl logs -f -l app.kubernetes.io/instance=mistral-7b-instruct

The model first downloads the model and after a few minutes, you should see a message that looks like this:

Invalid hostname, defaulting to 0.0.0.0

This is expected and means it's now serving on host 0.0.0.0.

By default, the model is only accessible within the Kubernetes cluster. To access it from your local machine, set up a port forward:

kubectl port-forward deployments/mistral-7b-instruct-text-generation-inference 8080:8080

This command maps port 8080 on your local machine to port 8080 on the deployed pod, allowing you to interact with the model directly.

With the service exposed, you can now run inference tasks. To explore the available API endpoints and their usage, visit the TGI API documentation at http://localhost:8080/docs.

Here’s an example of how to use curl to run an inference task:

curl 127.0.0.1:8080/generate -X POST \
    -H 'Content-Type: application/json' \
    --data-binary @- << 'EOF' | jq -r '.generated_text'
{
    "inputs": "<s>[INST] Write a K8s YAML file to create a pod that deploys nginx[/INST]",
    "parameters": {"max_new_tokens": 400}
}
EOF

In this example, we are instructing the model to generate a Kubernetes YAML file for deploying an Nginx pod. The prompt includes specific tokens that the Mistral 7B Instruct model recognizes, ensuring accurate and context-aware responses.

The prompt we are using starts with <s> token which indicates beginning of a sequence. The [INST] token tells Mistral-7b Instruct what follows is an instruction. The Mistral 7B Instruct model was finetuned with this prompt template, so it's important to re-use that same prompt template.

The response is quite impressive, it did return a valid K8s YAML manifest and also instructions on how to apply it.

Need help? Want to see other models? other serving frameworks?
Join our Discord and ask me directly:
discord-invite

The K8s YAML dataset

kubectl notebook

Excited to announce the K8s YAML dataset containing 276,520 valid K8s YAML files.

HuggingFace Dataset: https://huggingface.co/datasets/substratusai/the-stack-yaml-k8s
Source code: https://github.com/substratusai/the-stack-yaml-k8s

Why?

  • This dataset can be used to fine-tune an LLM directly
  • New datasets can be created from his dataset such as an K8s instruct dataset (coming soon!)
  • What's your use case?

How?

Getting a lot of K8s YAML manifests wasn't easy. My initial approach was to use the Kubernetes website and scrape the YAML example files, however the issue was the quantity since I could only scrape about ~250 YAML examples that way.

Luckily, I came across the-stack dataset which is a cleaned dataset of code on GitHub. The dataset is nicely structured by language and I noticed that yaml was one of the languages in the dataset.

Install libraries used in this blog post:

pip3 install datasets kubernetes-validate

Let's load the the-stack dataset but only the YAML files (takes about 200GB of disk space):

from datasets import load_dataset
ds = load_dataset("bigcode/the-stack", data_dir="data/yaml", split="train")

Once loaded there are 13,439,939 YAML files in ds.

You can check the content of one of the files:

print(ds[0]["content"])

You probably notice that this ain't a K8s YAML file, so next we need to filter these 13 million YAML files and only keep the one that have valid K8 YAML.

The approach I took was to use the kubernetes-validate OSS library. It turned out that YAML parsing was too slow so I added a 10x speed improvement by eagerly checking if "Kind or "kind" is not a substring in the YAML file.

Here is the validate function that takes the yaml_content as a string and returns if the content was valid K8s YAML or not:

import kubernetes_validate
import yaml

def validate(yaml_content: str):
    try:
        # Speed optimization to return early without having to load YAML
        if "kind" not in yaml_content and "Kind" not in yaml_content:
            return False
        data = yaml.safe_load(yaml_content)
        kubernetes_validate.validate(data, '1.22', strict=True)
        return True
    except Exception as e:
        return False

validate(ds[0]["content"])

Now all that's needed is to filter out all YAML files that aren't valid:

import os
os.cpu_count()
valid_k8s = ds.filter(lambda batch: [validate(x) for x in batch["content"]],
                      num_proc=os.cpu_count(), batched=True)

There were 276,520 YAML files left in valid_k8s. You can print one again to see:

print(valid_k8s[0]["content"])

You can upload the dataset back to HuggingFace by running:

valid_k8s.push_to_hub("substratusai/the-stack-yaml-k8s")

What's next?

Creating a new dataset called K8s Instruct that also provides a prompt for each YAML file.

Tutorial: K8s Kind with GPUs

Don't you just love it when you submit a PR and it turns out that no code is needed? That's exactly what happened when I tried add GPU support to Kind.

In this blog post you will learn how to configure Kind such that it can use the GPUs on your device. Credit to @klueska for the solution.

Install the NVIDIA container toolkit by following the official install docs.

Configure NVIDIA to be the default runtime for docker:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml:

sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml

Create a Kind Cluster:

kind create cluster --name substratus --config - <<EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
  # required for GPU workaround
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
EOF

Workaround for issue with missing required file /sbin/ldconfig.real:

# https://github.com/NVIDIA/nvidia-docker/issues/614#issuecomment-423991632
docker exec -ti substratus-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real

Install the K8s NVIDIA GPU operator so K8s is aware of your NVIDIA device:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false

You should now have a working Kind cluster that can access your GPU. Verify it by running a simple pod:

kubectl apply -f - << EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Converting HuggingFace Models to GGUF/GGML

Llama.cpp is a great way to run LLMs efficiently on CPUs and GPUs. The downside however is that you need to convert models to a format that's supported by Llama.cpp, which is now the GGUF file format. In this blog post you will learn how to convert a HuggingFace model (Vicuna 13b v1.5) to GGUF model.

At the time of writing, Llama.cpp supports the following models:

  • LLaMA πŸ¦™
  • LLaMA 2 πŸ¦™πŸ¦™
  • Falcon
  • Alpaca
  • GPT4All
  • Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
  • Vigogne (French)
  • Vicuna
  • Koala
  • OpenBuddy 🐢 (Multilingual)
  • Pygmalion 7B / Metharme 7B
  • WizardLM
  • Baichuan-7B and its derivations (such as baichuan-7b-sft)
  • Aquila-7B / AquilaChat-7B

At a high-level you will be going through the following steps:

  • Downloading a HuggingFace model
  • Running llama.cpp convert.py on the HuggingFace model
  • (Optionally) Uploading the model back to HuggingFace

Downloading a HuggingFace model

There are various ways to download models, but in my experience the huggingface_hub library has been the most reliable. The git clone method occasionally results in OOM errors for large models.

Install the huggingface_hub library:

pip install huggingface_hub

Create a Python script named download.py with the following content:

from huggingface_hub import snapshot_download
model_id="lmsys/vicuna-13b-v1.5"
snapshot_download(repo_id=model_id, local_dir="vicuna-hf",
                  local_dir_use_symlinks=False, revision="main")

Run the Python script:

python download.py

You should now have the model downloaded to a directory called vicuna-hf. Verify by running:

ls -lash vicuna-hf

Converting the model

Now it's time to convert the downloaded HuggingFace model to a GGUF model. Llama.cpp comes with a converter script to do this.

Get the script by cloning the llama.cpp repo:

git clone https://github.com/ggerganov/llama.cpp.git

Install the required python libraries:

pip install -r llama.cpp/requirements.txt

Verify the script is there and understand the various options:

python llama.cpp/convert.py -h

Convert the HF model to GGUF model:

python llama.cpp/convert.py vicuna-hf \
  --outfile vicuna-13b-v1.5.gguf \
  --outtype q8_0

In this case we're also quantizing the model to 8 bit by setting --outtype q8_0. Quantizing helps improve inference speed, but it can negatively impact quality. You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original quality.

Verify the GGUF model was created:

ls -lash vicuna-13b-v1.5.gguf

Pushing the GGUF model to HuggingFace

You can optionally push back the GGUF model to HuggingFace.

Create a Python script with the filename upload.py that has the following content:

from huggingface_hub import HfApi
api = HfApi()

model_id = "substratusai/vicuna-13b-v1.5-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="vicuna-13b-v1.5.gguf",
    path_in_repo="vicuna-13b-v1.5.gguf",
    repo_id=model_id,
)

Get a HuggingFace Token that has write permission from here: https://huggingface.co/settings/tokens

Set your HuggingFace token:

export HUGGING_FACE_HUB_TOKEN=<paste-your-own-token>

Run the upload.py script:

python upload.py

A Kind Local Llama on K8s

kubectl notebook

A Llama 13B parameter model running on a laptop with a mere RTX 2060?! Yes, it all ran surprisingly well at around 7 tokens / sec. Follow along and learn how to do this on your environment.

My laptop setup looks like this:

  • Kind for deploying a single node K8s cluster
  • AMD Ryzen 7 (8 threads), 16 GB system memory, RTX 2060 (6GB GPU memory)
  • Llama.cpp/GGML for fast serving and loading larger models on consumer hardware

You might be wondering: How can a model with 13 billion parameters fit into a 6GB GPU? You'd expect it to need about 13GB, especially if it's running in 4-bit mode, right? Yes it should because 13 billion * 4 bytes / (32 bits / 4 bits) = 13 GB. But thanks to Llama.cpp, we can load only parts of the model into the GPU. Plus, Llama.cpp can run efficiently just using the CPU.

Want to try this out yourself? Follow a long for a fun ride.

Create Kind K8s cluster with GPU support

Install the NVIDIA container toolkit for Docker: Install Guide

Use the convenience script to create a Kind cluster and configure GPU support:

bash <(curl https://raw.githubusercontent.com/substratusai/substratus/main/install/kind/up-gpu.sh)

Or inspect the script and run the steps one by one.

Install Substratus

Install the Substratus K8s operator which will orchestrate model loading and serving:

kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/install/kind/manifests.yaml

Load the Llama 2 13b chat GGUF model

Create a Model resource to load the Llama 2 13b chat GGUF model

apiVersion: substratus.ai/v1
kind: Model
metadata:
  name: llama2-13b-chat-gguf
spec:
  image: substratusai/model-loader-huggingface
  params:
    name: substratusai/Llama-2-13B-chat-GGUF
    files: "model.bin"
kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/examples/llama2-13b-chat-gguf/base-model.yaml

The model is being downloaded from HuggingFace into your Kind cluster.

Serve the model

Create a Server resource to serve the model: embedmd:# (https://raw.githubusercontent.com/substratusai/substratus/main/examples/llama2-13b-chat-gguf/server-gpu.yaml yaml)

apiVersion: substratus.ai/v1
kind: Server
metadata:
  name: llama2-13b-chat-gguf
spec:
  image: substratusai/model-server-llama-cpp:latest-gpu
  model:
    name: llama2-13b-chat-gguf
  params:
    n_gpu_layers: 30
  resources:
    gpu:
      count: 1
kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/examples/llama2-13b-chat-gguf/server-gpu.yaml

Note in my case 30 out of 42 layers loaded into GPU is the max, but you might be able to load all 42 layers into the GPU if you have more GPU memory.

Once the model is ready it will start serving an OpenAI compatible API endpoint.

Expose the Server to a local port by using port forwarding:

kubectl port-forward service/llama2-13b-chat-gguf-server 8080:8080

Let's throw some prompts at it:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{ "prompt": "Who was the first president of the United States?", "stop": ["."]}'

Checkout the full API docs here: http://localhost:8080/docs

You can play around with other models. For example, if you have a 24 GB GPU card you should be able to run Llama 2 70B in 4 bit mode by using llama.cpp.

Introducing: kubectl notebook

kubectl notebook

Substratus has added the kubectl notebook command!

"Wouldn't it be nice to have a single command that containerized your local directory and served it as a Jupyter Notebook running on a machine with a bunch of GPUs attached?"

The conversation went something like that while we daydreamed about our preferred workflow. At that point in time we were hopping back-n-forth between Google Colab and our containers while developing a LLM training job.

"Annnddd it should automatically sync file-changes back to your local directory so that you can commit your changes to git and kick off a long-running ML training job - containerized with the exact same python version and packages!"

So we built it!

kubectl notebook -d .

And now it has become an integral part of our workflow as we build out the Substratus ML platform.

Check out the 50 second screenshare:

Design Goals

  1. One command should build, launch, and sync the Notebook.
  2. Users should only need a Kubeconfig - no other credentials.
  3. Admins should not need to setup networking, TLS, etc.

Implementation

We tackled our design goals using the following techniques:

  1. Implemented as a single Go binary, executed as a kubectl plugin.
  2. Signed URLs allow for users to upload their local directory to a bucket without requiring cloud credentials (Similar to how popular consumer clouds function).
  3. Kubernetes port-forwarding allows for serving remote notebooks without requiring admins to deal with networking / TLS concerns. It also leans on existing Kubernetes RBAC for access control.

Some interesting details:

  • Builds are executed remotely for two reasons:
    • Users don't need to install docker.
    • It avoids pushing massive container images from one's local machine (pip installs often inflate the final docker image to be much larger than the build context itself).
  • The client requests an upload URL by specifying the MD5 hash it wishes to upload - allowing for server-side signature verification.
  • Builds are skipped entirely if the MD5 hash of the build context already exists in the bucket.

The system underneath the notebook command:

diagram

More to come!

Lazy-loading large models from disk... Incremental dataset loading... Stay tuned to learn more about how Notebooks on Substratus can speed up your ML workflows.

Don't forget to star and follow the repo!

https://github.com/substratusai/substratus

Tutorial: Llama2 70b serving on GKE

Llama 2 70b is the newest iteration of the Llama model published by Meta, sporting 7 Billion parameters. Follow along in this tutorial to get Llama 2 70b deployed on GKE:

  1. Create a GKE cluster with Substratus installed.
  2. Load the Llama 2 70b model from HuggingFace.
  3. Serve the model via an interactive inference server.

Install Substratus on GCP

Use the Installation Guide for GCP to install Substratus.

Load the Model into Substratus

You will need to agree to HuggingFace's terms before you can use the Llama 2 model. This means you will need to pass your HuggingFace token to Substratus.

Let's tell Substratus how to import Llama 2 by defining a Model resource. Create a file named base-model.yaml with the following content:

apiVersion: substratus.ai/v1
kind: Model
metadata:
  name: llama-2-70b
spec:
  image: substratusai/model-loader-huggingface
  env:
    # You would first have to create a secret named `ai` that
    # has the key `HUGGING_FACE_HUB_TOKEN` set to your token.
    # E.g. create the secret by running:
    # kubectl create secret generic ai --from-literal="HUGGING_FACE_HUB_TOKEN=<my-token>
    HUGGING_FACE_HUB_TOKEN: ${{ secrets.ai.HUGGING_FACE_HUB_TOKEN }}
  params:
    name: meta-llama/Llama-2-70b-hf

Get your HuggingFace token by going to HuggingFace Settings > Access Tokens.

Create a secret with your HuggingFace token:

kubectl create secret generic ai --from-literal="HUGGING_FACE_HUB_TOKEN=<my-token>

Make sure to replace <my-token> with your actual token.

Run the following command to load the base model:

kubectl apply -f base-model.yaml

Watch Substratus kick off your importing Job.

kubectl get jobs -w

You can view the Job logs by running:

kubectl logs -f jobs/llama-2-70b-modeller

Serve the Loaded Model

While the Model is loading, we can define our inference server. Create a file named server.yaml with the following content:

apiVersion: substratus.ai/v1
kind: Server
metadata:
  name: llama-2-70b
spec:
  image: substratusai/model-server-basaran
  model:
    name: llama-2-70b
  env:
    MODEL_LOAD_IN_4BIT: "true"
  resources:
    gpu:
      type: nvidia-a100
      count: 1

Create the Server by running:

kubectl apply -f server.yaml

Once the Model is loaded (marked as ready), Substratus will automatically launch the server. View the state of both resources using kubectl:

kubectl get models,servers

To view more information about either the Model or Server, you can use kubectl describe:

kubectl describe -f base-model.yaml
# OR
kubectl describe -f server.yaml

Once the model is loaded, the initial server startup time is about 20 minutes. This is because the model is 100GB+ in size and takes a while to load into GPU memory.

Look for a log message that the container is serving at port 8080. You can check the logs by running:

kubectl logs deployment/llama-2-70b-server

For demo purposes, you can use port forwarding once the Server is ready on port 8080. Run the following command to forward the container port 8080 to your localhost port 8080:

kubectl port-forward service/llama-2-70b-server 8080:8080

Interact with Llama 2 in your browser: http://localhost:8080

You have now deployed Llama 2 70b!

You can repeat these steps for other models. For example, you could instead deploy the "Instruct" variation of Llama.

Stay tuned for another blog post on how to fine-tune Llama 2 70b on your own data.