Run LLMs from Hugging Face on GCP with Cloud Run and Cloud Storage

GCP LLMs architecture

This article demonstrates how to run a custom Large Language Model on GCP with Cloud Run and use Cloud Storage as a network file system to host the weights downloaded from a model hub. By leaveraging Cloud Storage, the weights will be downloaded once so we can scale the number of Cloud Run containers up or down faster as new instances will not have to pay extra time for downloading model weights again and again.

We will see how to mount a Cloud Storage bucket onto our Cloud Run container using the open source FUSE adapter to share data between multiple containers and services.

Note that in this article we will use the FUSE adapter to provide a peristed filesystem to our container, but alternatively we can also use Firestore.

Design Overview

The above diagram illustrates the overall architecture of the solution which we further explain here:

Note that for best performance and removing networking cost, it is best to have the Cloud Run service and Cloud Storage bucket located within same region.

Cloud Storage

We need to setup a Cloud Storage bucket, let’s first define some environment variables

export PROJECT_ID=
export REGION=
export BUCKET_NAME=

Create a Cloud Storage bucket or reuse an existing bucket:

gsutil mb -l $REGION gs://$BUCKET_NAME

Create a service account to serve as the service identity:

gcloud iam service-accounts create fs-identity

Grant the service account access to the Cloud Storage bucket:

gcloud projects add-iam-policy-binding $PROJECT_ID \
     --member "serviceAccount:fs-identity@$PROJECT_ID.iam.gserviceaccount.com" \
     --role "roles/storage.objectAdmin"

Dockerfile

The following Dockerfile defines the environment configuration for our service. First, using the RUN instruction it will install tini as the init-process and gcsfuse, the FUSE adapter. Then, creates a working directory, copy source code, and install python dependencies in requirements.txt.

The ENTRYPOINT launches the tini init-process binary to proxy all received signals to the children processes. The CMD instruction will execute the startup script that will actually launches the python application.

# Use the official lightweight Python image.
# https://hub.docker.com/_/python
FROM python:3.11-buster

# Install system dependencies
RUN set -e; \
    apt-get update -y && apt-get install -y \
    tini \
    lsb-release; \
    gcsFuseRepo=gcsfuse-`lsb_release -c -s`; \
    echo "deb http://packages.cloud.google.com/apt $gcsFuseRepo main" | \
    tee /etc/apt/sources.list.d/gcsfuse.list; \
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
    apt-key add -; \
    apt-get update; \
    apt-get install -y gcsfuse \
    && apt-get clean

# Set fallback mount directory
ENV MNT_DIR /mnt/gcs

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

# Install production dependencies.
RUN pip install -r requirements.txt

# Ensure the script is executable
RUN chmod +x /app/entrypoint.sh

# Use tini to manage zombie processes and signal forwarding
ENTRYPOINT ["/usr/bin/tini", "--"] 

# Pass the startup script as arguments to Tini
CMD ["$APP_HOME/entrypoint.sh"]

Startup script

In the startup script, we mount point directory, where the Cloud Storage bucket will be made accessible. Then, using the gcsfuse command, we attach the Cloud Storage bucket to the mount point we just created. Once the bucket is attached, we start the python script that will download the LLM weights from the model hub. This script will avoid downloading again in case the model weights were previously downloaded. Lastly, we start the application server that will receive actual HTTP traffic and handle it.

#!/usr/bin/env bash
set -eo pipefail

# Create mount directory for service
mkdir -p $MNT_DIR

echo "Mounting GCS Fuse."
gcsfuse --debug_gcs --debug_fuse $BUCKET $MNT_DIR 
echo "Mounting completed."

# Create directory for Hugging Face 
mkdir -p $MNT_DIR/hf

# Export needed environment variables
export HF_HOME=$MNT_DIR/hf
export HF_HUB_ENABLE_HF_TRANSFER=1
export SAFETENSORS_FAST_GPU=1
export BITSANDBYTES_NOWELCOME=1
export PIP_DISABLE_PIP_VERSION_CHECK=1
export PIP_NO_CACHE_DIR=1

# Download model weights
echo "Downloading from Hugging Face Hub."
$APP_HOME/download.sh
echo "Downloading completed."

# Run the web service
python3 $APP_HOME/main.py

Note: The gcsfuse command has built-in retry functionality; therefore no special handling is not required.

LLM application

The section details the different files needed by our LLM application.

Declaring dependencies

First declare the dependencies to install pytorch and other librries needed to run download the LLM weights and run inference. Let’s create a requirements.txt file in the same directory as the previous shell script entrypoint.sh, with following content:

pytorch-cuda=11.7
google-cloud-storage
transformers~=4.28.1
safetensors~=0.3.0
accelerate~=0.18.0
bitsandbytes~=0.38.1
sentencepiece~=0.1.98
hf-transfer~=0.1.3
msgspec~=0.14.2

Downloading weights

Then, we create a download.sh Python script to download the model weights from Hugging Face Hub using snapshot_download function, save them locally at the mount point MNT_DIR.

The model we are downloading is google/flan-t5-base which is a Causal Language Model that we will use in our application to generate text. But any other models can be used as well.

#!/usr/bin/env python3
import os
from huggingface_hub import snapshot_download

model_path = snapshot_download(
    "google/flan-t5-base",
    ignore_patterns=["*.md"],
    destination=os.environ['MNT_DIR'],
    )

Loading LLM

Next, in a model.py python file, we define a helper class to load the model from local filesystem (by providing the local_files_only flag). This class will also expose a function to use it for inference (in our case text generation).

import torch
from transformers import AutoTokenizer, TextIteratorStreamer, pipeline

class TextGenerationLLM:
    def __init__(self, model_url: str = "google/flan-t5-base"):
        # make sure we don't connect to HF Hub
        os.environ["HF_HUB_OFFLINE"] = "1"
        os.environ["TRANSFORMERS_OFFLINE"] = "1"
        # setup text generation pipeline
        tokenizer = AutoTokenizer.from_pretrained(model_url, local_files_only=True)
        self.generator = pipeline(
            "text-generation",
            model=model_url,
            tokenizer=tokenizer,
            torch_dtype=torch.float16,
            device_map="auto",
            model_kwargs={"local_files_only": True},
        )
        self.generator.model = torch.compile(self.generator.model)

    def generate(self, prompt: str) -> str:
        results = generator(prompt, max_length=1000, do_sample=True)
        return results[0]["generated_text"]

Starting app

The last file we need to define, is the main.py python file which creates a Flask application and uses TextGenerationLLM to load the model previous saved at the mounting point MNT_DIR. The application, receives HTTP requests at the /predict endpoint and passes the body to the LLM for text generation.

import os
from model import TextGenerationLLM
from flask import Flask, request

llm = TextGenerationLLM(os.environ['MNT_DIR'])

@app.route('/predict', methods= ['POST'])
def predict():
    prompt = request.data
    response = llm.predict(prompt)
    return response

if __name__ == "__main__":
    app.run(port=8000, host='0.0.0.0', debug=True)

Note: in a production setting we would need to do some checks and validations on the user input before passing it to our LLM.

Cloud Run

Finally, we can deploy the container image to Cloud Run using gcloud:

gcloud run deploy llm-run --source . \
    --execution-environment gen2 \
    --allow-unauthenticated \
    --service-account fs-identity \
    --update-env-vars BUCKET=$BUCKET_NAME

Note: To mount a file system, we need to use the Cloud Run 2nd generation execution environment.

After deployment finishes and the service becomes available, we can test it with the following curl

curl -X POST -H "Content-Type: text/plain" -d "Tell me a joke" https://my-service-abcdef-uc.a.run.app

That’s all folks

In this article we saw how easy it is to use Google Cloud Run to package LLM applications, and leaverage Cloud Storage to store the weights once and for all.

I hope you enjoyed this article, feel free to leave a comment or reach out on twitter @bachiirc.