22 July 2024 / PROJECT

Introduction to RAG models

Firstly, in case you don't know what is RAG here is an unofficial explanation. Imagine you’re on a treasure hunt, but instead of a dusty old map, you’ve got a genius guide who knows every hidden corner. That’s RAG, short for Retrieval-Augmented Generation. It’s like having a super-smart friend who fetches the most relevant bits of knowledge from a massive library (the retrieval part) and then crafts a perfectly tailored response just for you (the generation part). So, if your brain is a bit like a rusty old filing cabinet, think of RAG as your personal, turbo-charged librarian who’s always got the answer before you can say "Google it!"

In this series of posts, we are going to explore LangChain tools and create RAG models for several applications. We are going to go through the little details that may or may not work in our cases and how to fix those teeny tiny configurations that may be neccesary for our purposes. Without further ado..

Dotenv File

Before moving forward to modules and script, there is a high need for a .env file. Initially, you need to create a .env file in the folder you are going to create this project, so that you save your passwords. This is not neccesary from the functionality point of view but it is from the security point of view. Think of a .env file as your project’s secret diary, where it whispers all its deepest, darkest secrets like passwords, API keys, and configuration settings. You need it because you don’t want these secrets plastered all over the code like graffiti. By keeping them in a .env file, you ensure they stay hidden and safe, only revealing themselves to those in the know—your code. So, a .env file is like having a secret stash of information that keeps your project running smoothly without spilling the beans to the world. No need for functy file name or anything just .env file.

Modules Required

The python modules we are going to use for this introductory project are the following

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import hub
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI

os: Used for accessing environment variables with os.getenv().
dotenv.load_dotenv: Intended to load environment variables from the .env file.
langchain_community.document_loaders.PyPDFLoader: Used to load PDF documents.
langchain_community.document_loaders.DirectoryLoader: Used to load documents from a directory.
langchain.text_splitter.RecursiveCharacterTextSplitter: Used to split documents into chunks.
langchain_community.vectorstores.Chroma: Used to create a vectorstore from document embeddings.
langchain.chat_models.AzureChatOpenAI: Used to initialize the Azure OpenAI model for the QA chain.
langchain.hub: Used to pull a prompt from the hub.
langchain_core.output_parsers.StrOutputParser: Used to parse the output of the QA chain.
langchain_core.runnables.RunnablePassthrough: Used in the QA chain to pass through the question.

In the file where we will run this python script, we need to create a folder where we are going to save our data, our files. Else we can load them from a different path. For the purposes of this series, I decided to use the first option. We use PyPDFLoader to laod them as in the following script:

loader = DirectoryLoader('data/', glob = '*.pdf', loader_cls=PyPDFLoader)
documents = loader.load()

Here we see that the first argument data/ is the relative path where the documents are placed. WIth the second argument *.pdf we define that we want to take under consideration all the files that end with the afforementioned character sequence '.pdf' (the word any is represented by '*'). Lastly, we define the loader class to use for the purposes of the files loading process PyPDFLoader. Mind that there are other classes that can be used for different types of documents like UnstructuredFileLoader, TextLoader, BSHTMLLoader, CSVLoader. To find more about this check here. Let your imagination go wild! There are also other types of loaders (which we are going to explore later on) where you can load information from webpages, youtube and many many more. After that is completed we use this class and load() everything exists in the file path and save them to the onject documents.

Chunks and Overlap

Next in line of the things we need to accomplish is split the documents into chunks. Chunk?!?!? Imagine you’re trying to eat a massive pizza all by yourself. Chunk_size is like deciding how many slices you cut it into so you can manage each piece without choking. Chunk overlap, on the other hand, is making sure each slice has a bit of the previous one’s crust, so you don’t miss any of the delicious toppings in between. To optimize them, you balance the slice size (chunk_size) to be just right for easy munching, and the overlap so you get all the flavors without making it too repetitive. Get it right, and you’ll devour that pizza with maximum efficiency and satisfaction!

Here are some tips to help you determine the optimal chunk size if common chunking methods, such as fixed chunking, are not suitable for your use case:

Data Preprocessing: Before deciding on the best chunk size, you need to preprocess your data to ensure its quality. For instance, if your data is sourced from the web, you might need to remove HTML tags or other noise elements.
Chunk Sizes: After preprocessing, choose a range of potential chunk sizes to test. The selection should consider the nature of the content (e.g., short messages vs. lengthy documents), the embedding model you’ll use, and its token limits. Aim to find a balance between preserving context and maintaining accuracy. Start by exploring various chunk sizes, such as smaller chunks (e.g., 128 or 256 tokens) for capturing granular semantic information and larger chunks (e.g., 512 or 1024 tokens) for retaining more context.
Evaluation of Chunk Sizes by performance results.: To test different chunk sizes, use either multiple indices or a single index with multiple namespaces. Create embeddings for the selected chunk sizes using a representative dataset and save them in your index or indices. Then, run a series of queries to evaluate the quality and compare the performance of the various chunk sizes. This process is likely iterative, requiring you to test different chunk sizes against different queries until you identify the best-performing chunk size for your content and expected queries.

For more information regarding chunk size and chunk overlap, you may refer on Guide to Chunk Size and Overlap by Kaggle, Chunk Sizes by Llama Index or A Guide to Chunking Strategies for Retrieval Augmented Generation (RAG) by Zilliz. We are not goint to go through the process of deciding the optimal chunk size, as this is out of this project's scope.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
text_chunks = text_splitter.split_documents(documents)

Now that we have collected our data, and divided into some well structured chunks, easily digestible pieces, we need to create an embeddings model. This model will assists us on creatign embeddings out of the collected and stored documents that we are interested on finding further information about. As I have already mentioned above, I have decided to use Azure OpenAI to assists us. It is up to you which model you are going to use. there are many alternatives bnoth free and paid ones for each part of our project to consider. Do not hesitate to ask me if you have any qyestions on how to change the code so that you use a different kind of model!

embeddings = AzureOpenAIEmbeddings(
    deployment=os.getenv('OPENAI_DEPLOYMENT_NAME_EMB'),
    model=os.getenv('OPENAI_MODEL_NAME_EMB'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    openai_api_type=os.getenv('OPENAI_API_TYPE'),
)

Since we have our embeddings, we are now ready to create our vectorstore where we are going to save our embeddings. Imagine you’ve got a magical, super-organized pantry where every ingredient knows exactly where it belongs and can jump right into your hand when you need it. That’s a vectorstore! It’s a special kind of database where information is stored as vectors, or points in a high-dimensional space, making it super easy to find and retrieve. So, a vectorstore is like having a pantry where every spice, snack, and secret ingredient is neatly indexed and ready to leap out at your command, making your cooking—or in this case, data retrieval—fast and efficient! We have decided to use Chroma as our vectorstore service, but youare free to use any one you need. Some alternatives that you could consider are Pinecone, FAISS, Lance where you can find further information here and SKLearnVectorStore.

vectorstore = Chroma.from_documents(documents = text_chunks,                                    
    embedding = embeddings,
    persist_directory="data/vectorstore"
)

Retriever

Now that we have our data vectorestore set up, we are ready to initialize our retriever. Picture the retriever in a RAG process as your ultra-savvy shopping buddy who knows exactly where everything is in the store. When you need something specific, the retriever zips around the aisles, grabbing the most relevant items off the shelves and bringing them back to you in record time. In the RAG (Retrieval-Augmented Generation) process, the retriever’s job is to fetch the most pertinent pieces of information from a vast database, so the generator can then whip up a perfectly informed response. It’s like having a shopping wizard who makes sure you always have the right ingredients for the perfect recipe!

retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
prompt = hub.pull('rlm/rag-prompt')

Argument 'k':5 makes sure that our retriever will bring back to the generator the 5 most similar items, not 6, not 4. The line hub.pull('rlm/rag-prompt') is used to pull a specific prompt template named 'rlm/rag-prompt' from a hub. You could define and use your own prompt for this part (which we shall experiment in a later on post). To find out more on predefined qa-prompts, go to <a href'https://docs.smith.langchain.com/old/category/prompt-hub'>Langchain Hub</a>.

Chain Creation

The next step in our RAG project is to define the LLM (Large Language Model) that will provide the answers for us. Imagine an LLM as a super-intelligent, chatty robot that’s read every book, article, and meme on the internet and somehow remembers them all. It stands for Large Language Model, and it’s like having a best friend who’s always ready to chat, offer advice, or spin a tale, because it’s been trained on a vast mountain of text data. This robot buddy can understand your questions and whip up responses that sound like they came straight from a well-read, eloquent author. So, if you ever need a conversation partner who’s a walking encyclopedia with a knack for witty comebacks, the LLM’s got your back!

llm = AzureChatOpenAI(
    deployment_name=os.getenv('LLM_35_Deployment'),
    model_name=os.getenv('LLM_35_Model'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    temperature=0,
)

Moving forward, we need to define the RAG chain. Think of a RAG chain as a magical relay race where information is passed along to make the ultimate answer. Imagine a team of information specialists: the first runner grabs the relevant facts (that’s the retriever’s job), the second runner crafts those facts into a coherent, brilliant response (thanks to the generator), and the baton gets passed seamlessly from one to the other. This chain of handoffs ensures you get a well-rounded, perfectly polished answer every time. So, a RAG chain is like a finely-tuned relay team making sure no detail gets left behind and every answer is a winner! We need the RAG chain to clearly define the steps that we need to be included from prompt to final response. Later on we are going to take a closer look on LangGraph a new tool of Langchain which assist us on defining clearly a this process.

rag_chain = (
    {"context": lambda x: retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

You can see clearly what are the steps of our first really simple RAG chain. Initially, the contect and question are passed through associated to the lambda "x" object (in our case the retriever). On the next step the information (context and question) are passed to the prompt so that the instructions are provided to our llm selected model. The model analyzes and constructes an answer to our question. At the end, with the assistance of StrOutputParser() it is ensures that the answer is presented in the desired human readable format.

Now you are ready to ask your own questions to the model you just created to find out more about your documents.

question = "My custom question"
rag_chain.invoke(question)

Well, I hope you enjoyed this as much as I did and learned something from it! Hope to see you again in the next post of this series where we are going to talk about Query transformation and how we can use LLM models so that an LLM can define our question on a different "better" way.

Be safe, code safer!

This whole series is inspired by <a = href='https://www.sakunaharinda.xyz/ragatouille-book/intro.html#'>Ragatoulle</a>. Throughout this series, as this is the main goal of this blog, I aim to provide as simple as possible explanations about the modules and functions used with addition of my personal touch wherever I find it necessary.

Introduction to RAG models

Dotenv File

Modules Required

Chunks and Overlap

Retriever

Chain Creation

KAN or MLP - A Fairer Comparison

Stretching Each Dollar- Diffusion Training from Scratch on a Micro-Budget

Dotenv File

Modules Required

Chunks and Overlap

Retriever

Chain Creation

Subscribe to Kavour

Subscribe to Kavour