Scale LLM-based applications to millions with LangChain and GPTCache
06 Jul 2023 by dzlabOverview
In Software Engineering, whenever there is high cost for producing a result for a given query, a cache is used to avoid wasting resources again and again on calcuting the same result. Usually, the way a cache is a key-value data structure used as follows:
- For a first time seen query, the results are stored temporarily in high-speed storage layers (e.g. RAM or SSDs),
- When a new query arrives, we first check if results are available in the cache before triggering a new caculation
- Results are sent back to the client and also store in the cache for next time retrieval
The use of cache, in most cases, causes application performance boost, better scalability, and reduced operational and financial costs (see OpenAI API pricing).
In the case of LLM applications, caching usually relies on the use of embedding algorithms to convert queries into embeddings and then uses a vector store for similarity search on these embeddings. This allows the identification and retrieval of similar prompt/queries from the cache so that answers are returned immediately without calling model endpoints.
Enter GPTCache
The LangChain library has become the backbone of LLM-based applications, it simplifies the development a lot and allows the chaining (hence the name) of different components: streamline prompt optimization, invoke models API, etc. It does provide serveral ways to cache prompt-completion pairs via third-party integrations. GPTCache is one of the well supported LLM cache systems.
As depicted in the above diagram, GPTCache has several modules:
- LLM Adapter allows a smooth integratation with with LLMs
- Multimodal Adapter allows the integratation with multimodal models
- Embedding Generator allows the use several embedding algorthms such as OpenAI embeddings
- Cache Storage to save LLM responses
- Vector Store supports vectordbs Milvus, FAISS and Chroma among others
- Cache Manager implements different eviction strategies to ensure the cache is clean and not full
- Similarity Evaluator collects data from Cache and Vector Storage and evaluates the similarity between the input request and stored embeddings
LLM-based application
In the remaining of this article, we will see how caching the responses generated by language models improves the efficiency and speed of LLM-based applications. In particular, how we can limit cost by reducing network traffic to OpenAI API. We will:
- Build a knowledge base of Arxiv papers for testing
- Create embeddings for documents and store them in a vector database
- Setup LangChain to query data from the vector database
- Use GPTCache to reduce network requests
Setup
Let’s start by setting up everything.
LLM
We can use any LLM for this experiment but for simplicity we will go with OpenAI. So sign up to the service, and generate an API Key. Then create a .env
to store the key as follows:
OPENAI_API_KEY=<your_key_here>
Instalation
First, let’s install all necessary libraries
pip install langchain gptcache openai tiktoken python-dotenv arxiv pypdf
Then, import general purpose libraries
from urllib.error import HTTPError
from dotenv import load_dotenv
from tqdm import tqdm
import os
import logging
import arxiv
import time
Import langchain
related helpers and classes, for instance RecursiveCharacterTextSplitter
which will recursively try to find best way (i.e. split character) to split words. Also, PyPDFDirectoryLoader
to load pdfs from a given directory.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Milvus
Import GPTCache
related helpers and classes, e.g. similarity evaluation function.
from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache import cache
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
Then, load environment variables like OpenAI API token.
load_dotenv()
Vector Databse
Next, we need to setup a Vector database to store the embeddings. We will use Milvus
which support caching too. Milvus
is an open source database that can be self-hosted or use the managed Milvus instance at https://cloud.zilliz.com/. In our case, we will use Docker Compose to run it locally.
First, download Milvus’s Docker Compose YAML file to run it
curl https://github.com/milvus-io/milvus/releases/download/v2.2.10/milvus-standalone-docker-compose.yml -o docker-compose.yml
Then start the Milvus database with:
docker-compose up -d
Wait for few seconds and then we should see that the containers up and running. We could also watch the containers status by running in the terminal docker ps
.
GPTCache
As explained earlier, GPTCache is composed of multiple components, each one can be configured separately. In order to work with GPTCache, you have to initialize it first
First, we define a function that takes a dictionary as input and returns the last part of the prompt key, after the “Question” string. For example, if the prompt key is “Question: What is the meaning of life?”, the function would return “the meaning of life”.
def get_content_func(data, **_):
return data.get("prompt").split("Question")[-1]
The next few lines of code create objects needed by the cache:
Onnx
: to convert text into embeddings.CacheBase
: to store the embeddings in a database.VectorBase
: to interact with the Milvus vector database.data_manager
: that wrappers the CacheBase and VectorBase classes.
onnx = Onnx()
cache_base = CacheBase("sqlite")
vector_base = VectorBase(
"milvus",
host="localhost", port="19530",
dimension=onnx.dimension,
collection_name="arxiv"
)
data_manager = get_data_manager(cache_base, vector_base)
Then, we call the init() method on the cache object with the previously created objects to initialize GPTCache. The inititialization takes several arguments, including:
pre_embedding_func
: a function to extract the content from the input data.embedding_func
: a function to convert the content into embeddings.data_manager
: an object to store the embeddings.similarity_evaluation
: an object to evaluate the similarity between embeddings.
cache.init(
pre_embedding_func=get_content_func,
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
Finallay, we call set_openai_key()
method on the cache to set the OpenAI API key.
cache.set_openai_key()
Knowledge base
Next, we need to build a knowledge base that we will inquiry with OpenAPI. We will use a collection of Arxiv papers that we will download in their PDF format.
First, pick a query for selecting papers from Arxiv
search = arxiv.Search(
query = "A survey of Large Language Models"
)
Let’s have a look at the metadata of the paper
for result in search.results():
print(f" Link: {result.pdf_url}")
print(f" ID: {result.get_short_id()}")
print(f" Title: {result.title}")
print(f"Category: {result.categories}")
print(f" Summary: {result.summary[:200]}")
Create a directory to host the arxiv papers
ARXIV_DIR = "arxiv"
os.mkdir(ARXIV_DIR)
Then, download the papers into that directory
for paper in tqdm(search.results()):
paper.download_pdf(dirpath=ARXIV_DIR)
print(f"Paper ID {paper.get_short_id()} with title '{paper.title}' is downloaded.")
Then, we load the pages from all the papers that we downloaded
papers = []
loader = PyPDFDirectoryLoader(ARXIV_DIR)
pages = loader.load()
print(f"Total number of pages: {len(pages)}")
Next, we need to merge all pages into a single text block so we can split it using RecursiveCharacterTextSplitter
.
full_text = ''.join([page.page_content for page in pages])
full_text = " ".join(line for line in full_text.splitlines() if line)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents([full_text])
Then, we calculate the embeddings for every chuck and store everything in our Vector database
embeddings = OpenAIEmbeddings()
vector_db = Milvus.from_documents(
docs,
embeddings,
connection_args={"host": "localhost", "port": "19530"}
)
Querying the Knowledge base
Before proceeding further, we need to check that everything in the vector database is properly configured. For this, let’s run a simple sanity check query.
docs = vector_db.similarity_search("What are the latest achievements?")
Note: we could enable logging to see DEBUG messages about how requests are routed to OpenAI API or served from the cache.
We can ask the same question with returned documents as context to generate a response:
llm = LangChainLLMs(llm=OpenAI(temperature=0))
chain = load_qa_chain(llm, chain_type="stuff")
res = chain.run(input_documents=docs, question="What are the latest achievements?")
print(res)
At this point, the question and response pair are cached, and any new query that is considered similar will receive a same answer directly from the cache. Let’s confirm:
res = chain.run(input_documents=docs, question="Tell us about any recent advancements?")
print(res)
Ask another different question that should not have a cached answer to cause a request to be sent to OpenAI API.
res = chain.run(input_documents=docs, question="Are we able to solve legal tasks?")
print(res)
And another a similar question
res = chain.run(input_documents=docs, question="Do we have the ability of legal interpretation and reasoning?")
print(res)
Try this on a different set of papers, or even on your own knowledge base.
That’s all folks
I hope you enjoyed this article, feel free to leave a comment or reach out on twitter @bachiirc.