Scale LLM-based applications to millions with LangChain and GPTCache

06 Jul 2023 by dzlab

Overview

In Software Engineering, whenever there is high cost for producing a result for a given query, a cache is used to avoid wasting resources again and again on calcuting the same result. Usually, the way a cache is a key-value data structure used as follows:

For a first time seen query, the results are stored temporarily in high-speed storage layers (e.g. RAM or SSDs),
When a new query arrives, we first check if results are available in the cache before triggering a new caculation
Results are sent back to the client and also store in the cache for next time retrieval

The use of cache, in most cases, causes application performance boost, better scalability, and reduced operational and financial costs (see OpenAI API pricing).

In the case of LLM applications, caching usually relies on the use of embedding algorithms to convert queries into embeddings and then uses a vector store for similarity search on these embeddings. This allows the identification and retrieval of similar prompt/queries from the cache so that answers are returned immediately without calling model endpoints.

Enter GPTCache

The LangChain library has become the backbone of LLM-based applications, it simplifies the development a lot and allows the chaining (hence the name) of different components: streamline prompt optimization, invoke models API, etc. It does provide serveral ways to cache prompt-completion pairs via third-party integrations. GPTCache is one of the well supported LLM cache systems.

infrastructure related to GPTCache

As depicted in the above diagram, GPTCache has several modules:

LLM Adapter allows a smooth integratation with with LLMs
Multimodal Adapter allows the integratation with multimodal models
Embedding Generator allows the use several embedding algorthms such as OpenAI embeddings
Cache Storage to save LLM responses
Vector Store supports vectordbs Milvus, FAISS and Chroma among others
Cache Manager implements different eviction strategies to ensure the cache is clean and not full
Similarity Evaluator collects data from Cache and Vector Storage and evaluates the similarity between the input request and stored embeddings

LLM-based application

In the remaining of this article, we will see how caching the responses generated by language models improves the efficiency and speed of LLM-based applications. In particular, how we can limit cost by reducing network traffic to OpenAI API. We will:

Build a knowledge base of Arxiv papers for testing
Create embeddings for documents and store them in a vector database
Setup LangChain to query data from the vector database
Use GPTCache to reduce network requests

Setup

Let’s start by setting up everything.

LLM

We can use any LLM for this experiment but for simplicity we will go with OpenAI. So sign up to the service, and generate an API Key. Then create a .env to store the key as follows:

OPENAI_API_KEY=<your_key_here>

Instalation

First, let’s install all necessary libraries

pip install langchain gptcache openai tiktoken python-dotenv arxiv pypdf

Then, import general purpose libraries

from urllib.error import HTTPError
from dotenv import load_dotenv
from tqdm import tqdm
import os

import logging
import arxiv
import time

Import langchain related helpers and classes, for instance RecursiveCharacterTextSplitter which will recursively try to find best way (i.e. split character) to split words. Also, PyPDFDirectoryLoader to load pdfs from a given directory.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader

from langchain import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Milvus

Import GPTCache related helpers and classes, e.g. similarity evaluation function.

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache import cache
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

Then, load environment variables like OpenAI API token.

load_dotenv()

Vector Databse

Next, we need to setup a Vector database to store the embeddings. We will use Milvus which support caching too. Milvus is an open source database that can be self-hosted or use the managed Milvus instance at https://cloud.zilliz.com/. In our case, we will use Docker Compose to run it locally.

First, download Milvus’s Docker Compose YAML file to run it

curl https://github.com/milvus-io/milvus/releases/download/v2.2.10/milvus-standalone-docker-compose.yml -o docker-compose.yml

Then start the Milvus database with:

docker-compose up -d

Wait for few seconds and then we should see that the containers up and running. We could also watch the containers status by running in the terminal docker ps.

GPTCache

As explained earlier, GPTCache is composed of multiple components, each one can be configured separately. In order to work with GPTCache, you have to initialize it first

First, we define a function that takes a dictionary as input and returns the last part of the prompt key, after the “Question” string. For example, if the prompt key is “Question: What is the meaning of life?”, the function would return “the meaning of life”.

def get_content_func(data, **_):
    return data.get("prompt").split("Question")[-1]

The next few lines of code create objects needed by the cache:

Onnx: to convert text into embeddings.
CacheBase: to store the embeddings in a database.
VectorBase: to interact with the Milvus vector database.
data_manager: that wrappers the CacheBase and VectorBase classes.

onnx = Onnx()
cache_base = CacheBase("sqlite")
vector_base = VectorBase(
    "milvus",
    host="localhost", port="19530",
    dimension=onnx.dimension,
    collection_name="arxiv"
    )
data_manager = get_data_manager(cache_base, vector_base)

Then, we call the init() method on the cache object with the previously created objects to initialize GPTCache. The inititialization takes several arguments, including:

pre_embedding_func: a function to extract the content from the input data.
embedding_func: a function to convert the content into embeddings.
data_manager: an object to store the embeddings.
similarity_evaluation: an object to evaluate the similarity between embeddings.

cache.init(
    pre_embedding_func=get_content_func,
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)

Finallay, we call set_openai_key() method on the cache to set the OpenAI API key.

cache.set_openai_key()

Knowledge base

Next, we need to build a knowledge base that we will inquiry with OpenAPI. We will use a collection of Arxiv papers that we will download in their PDF format.

First, pick a query for selecting papers from Arxiv

search = arxiv.Search(
    query = "A survey of Large Language Models" 
)

Let’s have a look at the metadata of the paper

for result in search.results():
    print(f"    Link: {result.pdf_url}")
    print(f"      ID: {result.get_short_id()}")
    print(f"   Title: {result.title}")
    print(f"Category: {result.categories}")
    print(f" Summary: {result.summary[:200]}")

Create a directory to host the arxiv papers

ARXIV_DIR = "arxiv"
os.mkdir(ARXIV_DIR)

Then, download the papers into that directory

for paper in tqdm(search.results()):
    paper.download_pdf(dirpath=ARXIV_DIR)
    print(f"Paper ID {paper.get_short_id()} with title '{paper.title}' is downloaded.")

Then, we load the pages from all the papers that we downloaded

papers = []
loader = PyPDFDirectoryLoader(ARXIV_DIR)
pages = loader.load()

print(f"Total number of pages: {len(pages)}")

Next, we need to merge all pages into a single text block so we can split it using RecursiveCharacterTextSplitter.

full_text = ''.join([page.page_content for page in pages])

full_text = " ".join(line for line in full_text.splitlines() if line)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents([full_text])

Then, we calculate the embeddings for every chuck and store everything in our Vector database

embeddings = OpenAIEmbeddings()

vector_db = Milvus.from_documents(
    docs,
    embeddings,
    connection_args={"host": "localhost", "port": "19530"}
    )

Querying the Knowledge base

Before proceeding further, we need to check that everything in the vector database is properly configured. For this, let’s run a simple sanity check query.

docs = vector_db.similarity_search("What are the latest achievements?")

Note: we could enable logging to see DEBUG messages about how requests are routed to OpenAI API or served from the cache.

We can ask the same question with returned documents as context to generate a response:

llm = LangChainLLMs(llm=OpenAI(temperature=0))
chain = load_qa_chain(llm, chain_type="stuff")
res = chain.run(input_documents=docs, question="What are the latest achievements?")
print(res)

At this point, the question and response pair are cached, and any new query that is considered similar will receive a same answer directly from the cache. Let’s confirm:

res = chain.run(input_documents=docs, question="Tell us about any recent advancements?")
print(res)

Ask another different question that should not have a cached answer to cause a request to be sent to OpenAI API.

res = chain.run(input_documents=docs, question="Are we able to solve legal tasks?")
print(res)

And another a similar question

res = chain.run(input_documents=docs, question="Do we have the ability of legal interpretation and reasoning?")
print(res)

Try this on a different set of papers, or even on your own knowledge base.

That’s all folks

I hope you enjoyed this article, feel free to leave a comment or reach out on twitter @bachiirc.

All things