Build a RAG Pipeline Using Google Gemma

Gemma by Google is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models.

The following is an example on how to build a RAG pipeline using Google Gemma, HuggingFace’s transformers library and a vector database like Chroma.

Setup

First, install the needed libraries:

pip install -q -U "transformers==4.38.0" --upgrade
pip install langchain chromadb pypdf
pip install sentence-transformers

And import all modules

import pandas as pd
from bs4 import BeautifulSoup as bs4

import torch
from transformers import AutoTokenizer, pipeline

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_community.vectorstores import Chroma
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

Model Preparation

Then, we need to download Gemma-2b but before that we will have to accept the license and get access to the model. So, head to the following URL https://huggingface.co/google/gemma-2b-it

Also, get HuggingFace token to be able to download the model

HF_TOKEN = ''

Now we can create text generation pipeline as follows:

model = "google/gemma-2b-it"

tokenizer = AutoTokenizer.from_pretrained(model, token=HF_TOKEN)
pipeline = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
    max_new_tokens=512
)

We can test that everything is working by querying Gemma as follows:

messages = [
    {"role": "user", "content": "Provide a recipe of a popular meal in Algeria"},
]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    add_special_tokens=True,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)
print(outputs[0]["generated_text"][len(prompt):])

Knowledge Preparation

Document collection

We need to build Document collection for the RAG pipeline to query. This collection can be any type of documents, like articles, reports, or code. LangChain have many built-in document loaders that makes it easy to build such a collection. For instance, we can create a collection from web articles using WebBaseLoader as follows:

urls = [
  'https://www.kaggle.com/docs/competitions',
  'https://www.kaggle.com/docs/datasets',
  'https://www.kaggle.com/docs/notebooks',
  'https://www.kaggle.com/docs/api',
  'https://www.kaggle.com/docs/efficient-gpu-usage',
  'https://www.kaggle.com/docs/tpu',
  'https://www.kaggle.com/docs/models',
  'https://www.kaggle.com/docs/competitions-setup',
  'https://www.kaggle.com/docs/organizations',
]

loader = WebBaseLoader(urls)
docs = loader.load()

We can build a collection from PDF files store locally using PyPDFDirectoryLoader as follows:

DIR = './data'
loader = PyPDFDirectoryLoader(DIR)
docs = loader.load()

Vectorization

Next, we need to convert every document in our collection into a numerical representation called a “vector” using an embedding model. To make those vectors searchable we store them in a vector databse like Chroma as followsa:

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

db = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")

In case the original documents are large, we may want first to split them into smaller chunks so latter it would be easier to retrive only relevant documents. For this we can use RecursiveCharacterTextSplitter (or other variation of splitters from LangChain) as follows:

character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_docs = character_splitter.split_documents(docs)

# embbed the chunks
db = Chroma.from_documents(character_split_docs, embedding_function, persist_directory="./chroma_db")

Information Retrieval

Similarity Search

Once documents and their embeddings are stored in a speciallized databse like Chroma, we can use Similarity Search so that a query vector is compared against all document vectors in the database. Then, we retrieve Documents with the closest vector representations as they should contain the most relevant information to the query.

We can perform this easily with db.similarity_search

# query it
query = "How linear regression was used to win a Kaggle competition?"
match_docs = db.similarity_search(query)

# print results
print(f'Number of returned article: {len(match_docs)}')

And examine one of the returned articles:

one_doc = match_docs[0]
print(f'{one_doc.metadata["title"]} / {one_doc.metadata["publication_date"]}')
print(one_doc.page_content[:500])

Knowledge Integration

The retrieved documents are fed to Gemma as additional information and constitue a context to the original query. Then, Gemma will generate a response answering the question based on information provided in the context.

At this stage, we have everything to build our RAG by creating a RetrievalQA based on Gemma LLM and Chroma vector store:

retriever = db.as_retriever()
gemma_llm = HuggingFacePipeline(
    pipeline=pipeline,
    model_kwargs={"temperature": 0.7},
)

qa = RetrievalQA.from_chain_type(
    llm=gemma_llm,
    chain_type="stuff",
    retriever=retriever
)

Then, we can query it as follows:

query = "How linear regression was used to win a Kaggle competition"
qa.invoke(query)