Deploying Mixtral on GKE with 2 x L4 GPUs

by Sam Stoelinga

A100 and H100 GPUs are hard to get. They are also expensive. What if you could run Mixtral on just 2 x L4 24GB GPUs? The L4 GPUs are more attainable today (Feb 10, 2024) and are also cheaper. Learn how to easily deploy Mixtral on GKE with 2 x L4 GPUs in this blog post.

mixtral on gke

How much GPU memory is needed? Will it fit on 2 x L4?
For this post, we're using GPTQ quantization to load the model parameters in 4 bit. The estimated GPU memory when using 4bit GPTQ quantization would be:

M=(781094bytes)(32/4)1.2=33.6GB M = \dfrac{(7 * 8*10^9 * 4 \mathrm{bytes})}{ (32 / 4)} * 1.2 = 33.6\mathrm{GB}

A single L4 GPU has 24GB of GPU memory so 2 x L4 will have 48GB, which is more than enough to serve the Mixtral 7 * 8 billion parameter model. You can read the Calculating GPU memory for serving LLMs blog post for more information.

The post is structured in the following sections:

  1. Creating a GKE cluster with a spot L4 GPU node pool
  2. Downloading the model into a ReadManyOnly PVC
  3. Deploying Mixtral GPTQ using Helm and vLLM
  4. Trying out some prompts with Mixtral

Creating the GKE cluster

Create a cluster with a CPU nodepool for system services:

export CLUSTER_LOCATION=us-central1
export PROJECT_ID=$(gcloud config get-value project)
export CLUSTER_NAME=substratus
export NODEPOOL_ZONE=us-central1-a
gcloud container clusters create ${CLUSTER_NAME} --location ${CLUSTER_LOCATION} \
  --machine-type e2-medium --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --autoscaling-profile optimize-utilization --enable-autoscaling \
  --node-locations ${NODEPOOL_ZONE} --workload-pool ${PROJECT_ID} \
  --enable-image-streaming --enable-shielded-nodes --shielded-secure-boot \
  --shielded-integrity-monitoring \
  --addons GcsFuseCsiDriver

Create a GPU nodepool where each VM has 2 x L4 GPUs and uses Spot pricing:

gcloud container node-pools create g2-standard-24 \
  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
  --machine-type g2-standard-24 --ephemeral-storage-local-ssd=count=2 \
  --spot --enable-autoscaling --enable-image-streaming \
  --num-nodes=0 --min-nodes=0 --max-nodes=3 --cluster ${CLUSTER_NAME} \
  --node-locations "${NODEPOOL_ZONE}" --location ${CLUSTER_LOCATION}

Downloading the model into a ReadManyOnly PVC

Downloading a 8 * 7 billion parameter model every time you launch an inference server takes a long time and will be expensive in egress costs. Instead, we'll download the model to a Persistent Volume Claim. That PVC will be used in ReadManyOnly mode across all the Mixtral serving instances.

Create a file named pvc.yaml with the following content:

apiVersion: v1
kind: PersistentVolumeClaim
  name: mixtral-8x7b-instruct-gptq
    - ReadWriteOnce
      storage: 30Gi

Create the PVC to store the model weights on:

kubectl apply -f pvc.yaml

The following Job will download the model to the PVC mixtral-8x7b-instruct-gptq using the Huggingface Hub API. The model is downloaded to the /model directory in the PVC. The revision parameter is set to gptq-4bit-32g-actorder_True to download the model with GPTQ quantization in 4 bit.

Create a file named load-model-job.yaml with the following content:

apiVersion: batch/v1
kind: Job
  name: load-model-job-mixtral-8x7b-instruct-gptq
        - name: model
            claimName: mixtral-8x7b-instruct-gptq
      - name: model-loader
        image: python:3.11
        - mountPath: /model
          name: model
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub
          python3 - << EOF
          from huggingface_hub import snapshot_download
          snapshot_download(repo_id=model_id, local_dir="/model", cache_dir="/model",
      restartPolicy: Never

Launch the load model Job:

kubectl apply -f load-model-job.yaml

You can watch the progress of the Job using the following command:

kubectl logs -f job/load-model-job-mixtral-8x7b-instruct-gptq

After a few minutes, the model will be downloaded to the PVC.

Deploying Mixtral using Helm

We are maintaining a Helm chart for vLLM. The Helm chart is available at Github substratusai/helm. We've also published a container image for vLLM that takes environment variables. The vLLM container image is available at Github substratusai/vllm-docker.

Install the Helm repo:

helm repo add substratusai
helm repo update

Create a file named values.yaml with the following content:

model: /model
servedModelName: mixtral-8x7b-instruct-gptq
  enabled: true
  sourcePVC: "mixtral-8x7b-instruct-gptq"
  mountPath: /model
  size: 30Gi

quantization: gptq
dtype: half
maxModelLen: 8192
gpuMemoryUtilization: "0.8"

  limits: 2

nodeSelector: nvidia-l4

replicaCount: 1

Notice that we're specifying a readManyPVC with the sourcePVC set to mixtral-8x7b-instruct-gptq, which we created in the previous step. The mountPath is set to /model and model parameter for vLLM points to that local path. The quantization parameter is set to gptq and the dtype parameter is set to half. Setting it to half is required for GPTQ quantization. The gpuMemoryUtilization is set to 0.8, because otherwise you will get an out of GPU memory error. The replicaCount is set to 1 and so the moment you install the Helm chart, the deployment will start with 1 pod.

Looking for an autoscaling Mixtral deployment that supports scale from 0? Take a look at Lingo: ML Proxy and autoscaler for K8s.

Install the Helm chart:

helm install mixtral-8x7b-instruct-gptq substratusai/vllm -f values.yaml

After a while you can check whether pods are running:

kubectl get pods

Once the pods are running, check logs of the deployment pods:

kubectl logs -f deployment/mixtral-8x7b-instruct-gptq

Sent some prompts to Mixtral

A K8s Service of type ClusterIP named mixtral-instruct-gptq-vllm was created by the Helm chart. The Service by default is only accessible from within the cluster. You can use kubectl port-forward to forward the Service to your local machine.

kubectl port-forward service/mixtral-8x7b-instruct-gptq-vllm 8080:80

Sent a prompt to the Mixtral model using the following command:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mixtral-8x7b-instruct-gptq", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'

Got more questions? Don't hesitate to join our Discord and ask away.