Improving Retrieval Augmented Generation: A Step-by-Step Evaluation of RAG Pipelines

TLDR: If you are just looking for the code example without any explanations, see the Conclusion section at the end of this blog post.

In the fast-moving worlds of AI and LLMs, adding context - and therefore knowledge - to AI models has become from utmost importance. Among the many innovations driving progress, Retrieval Augmented Generation (RAG) stands out as a transformative approach. RAG pipelines enhance the generation capabilities of traditional language models by integrating external knowledge sources, allowing for the production of more accurate, informative, and contextually rich responses. However, as with any cutting-edge technology, optimizing RAG pipelines presents a unique set of challenges. These range from ensuring the relevance of retrieved information to balancing computational efficiency with output quality.

Enter ragas, a specialized tool designed to evaluate and subsequently improve the efficacy of RAG pipelines. Ragas offers a structured framework for assessing RAG pipelines across multiple dimensions, including accuracy, relevance, and efficiency. This tool is not just a testament to the complexity of RAG systems but also an acknowledgment of the critical need for standardized evaluation metrics in AI research. By providing detailed insights into model performance, ragas empowers researchers and developers to refine their RAG implementations, pushing the boundaries of what AI can achieve.

The importance of RAG in today's AI landscape cannot be overstated. As we demand more sophisticated interactions with AI systems, the underlying models must not only understand and generate natural language but also access and incorporate a vast expanse of external knowledge. This is where RAG shines, blending the generative prowess of language models with the depth and diversity of external databases. However, the true potential of RAG can only be unlocked through meticulous evaluation and continuous refinement - a process that ragas facilitates with unmatched precision. For a more detailed introduction into RAG and why it's such an important area, we provided a comprehensive overview in our previous blog post.

In this blog, we delve deep into the mechanics of RAG strategies, the pivotal role of ragas in their evaluation, and how leveraging this tool can lead to significant improvements in RAG systems. We'll help you set up ragas for the first time, guide you through its evaluation process, and lay out the best practices for interpreting and acting on its results.

High-Level Overview of ragas

Ragas provides several modules which come handy for evaluating RAG. The two main ones - and the ones we are using in this guide are:

TestsetGenerator: This module is responsible for generating test sets for evaluating RAG pipelines. It provides a variety of test generation strategies, including simple, reasoning, and multi-context strategies.
evaluate: This module is responsible for evaluating RAG pipelines using the generated test sets. It provides a variety of evaluation metrics, including answer relevancy, faithfulness, context recall, and context precision.

The TestsetGenerator loads a bunch of documents or text chunks and then uses an LLM to generate potential questions based on these documents as well as answers for these questions - based on the provided documents. The generated answers are then used as "ground truth" to evaluate the RAG pipeline in a subsequent step.

The evaluate module then uses the questions of the generated test set to evaluate the RAG pipeline. It again uses an LLM to validate the answers given from your RAG pipeline based on the questions provided in the test set. The LLM also is asked to validate how good the provided contexts fit the questions. And finally, the LLM answer is compared to the ground truth answer (which is part of the test set) to evaluate the LLM itself.

Step by Step Guide for setting up and using ragas to evaluate RAG Pipelines

Without further ado, let's run through the necessary steps to using ragas for RAG pipeline evaluation.

Install the required dependencies

1pip install openai langchain ragas==0.1.0rc1

Note: It's important to use ragas version 0.1 or greater, as quite a huge refactoring has been done in the latest version.

Import the necessary modules

1import os from langchain_community.document_loaders
2import DirectoryLoader from langchain_community.chat_models.openai
3import ChatOpenAI
4
5from ragas.testset.generator import TestsetGenerator
6from ragas.testset.evolutions import simple, reasoning, multi_context
7from ragas.metrics import ( answer_relevancy, faithfulness,
8context_recall, context_precision,) import pandas as pd from ragas
9import evaluate from datasets import Dataset ```

Set up the LLM (add OpenAI key and define which LLM API to use). It's highly advised to use GPT-4, as it's more consistent output helps in creating better, more comparable results.

1os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
2gpt4 = ChatOpenAI(model_name="gpt-4-turbo-preview")

Generating Test Sets

Generating the test set for RAG evaluation is quite an annoying task. The task which prevents most from even starting with the evaluation process. ragas provides a handy test set generator which reduces the friction there.

The test set generator basically does:

Load a set of documents and parses them into plain text
Use an LLM to create questions based on these documents
Create a result set with columns question, answer, and context (with context being the text chunk which was used for generating the questions and answers)

We will use the question and answer columns as validation questions and answers for our later evaluation step - with answer being the ground truth.

Note: Instead of using the ragas test set generation, you could also manually create a test set. Just make sure to have a dataframe with columns question, and answer.

Load the documents. Ragas needs a set of either LangChain or LlamaIndex documents. You might use any of the LangChain or LlamaIndex document loaders. Make sure that you load the same documents (or a subset) which are then also available in your RAG pipeline which you want to validate!

1 loader = DirectoryLoader("public/blog/validate-rag/demo-docs")
2 documents = loader.load()

Note: The DirectoryLoader is just one of the many document loaders.

Generate the test set, using the TestsetGenerator module. Using the distributions - parameter, you can define how many questions of which type you want to generate.

reasoning: Questions which require reasoning based on the documents.
simple: Simple questions based on the documents and are more complex to derive from the provided contexts.
multi_context: Questions which are generated from multiple contexts from the documents.

1 generator = TestsetGenerator.with_openai()
2 testset = generator.generate_with_langchain_docs(
3     documents,
4     test_size=100,
5     distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
6 )
7 test_df = testset.to_pandas()

Note: It's best advised to save the test set, so you can re-use it for later evaluations. Do not use different test questions for different RAG evaluation runs - always use the same.

Now it's time to use the generated questions and feed it to your RAG pipeline to test. Simply use the question column of the test_df we just created and use your RAG pipeline to find answers for these questions. For each question, store the answer, context and the question itself, best in a dataframe. By the end, you should have a dataframe with the following columns:

question: string; The question which was asked to your RAG pipeline
answer: string; The answer which your RAG pipeline provided
context: list; A list of contexts which was used to find the answer
ground_truths: list; A list of ground truths for the question. Ground truths are the known correct answers for the question. You can use the answer which was generated by the TestsetGenerator as ground truth. (Because the TestsetGenerator knew what the answer was, as it created the questions, based on provided contexts - which naturally contained the answer). Note that this needs to be a list, as the evaluate module expects a list of ground truths for each question.

For this guide, we are going to cheat a little and re-use our testset as our RAG output (because setting up a RAG pipeline is not part of this guide). For you to note is, that instead of the code below, invoke your RAG pipeline and construct a dataframe as discussed above.

1 test_questions = test_df["question"].values.tolist()
2 test_answers = test_df["answer"].values.tolist()
3
4 df = pd.DataFrame(
5     {
6         "question": test_questions,
7         "answer": test_answers,
8         "contexts": test_df["context"].values.tolist(),
9         "ground_truths": test_df["answer"].values.tolist(),
10     }
11 )
12
13 df["contexts"] = df["contexts"].apply(lambda x: [x] if isinstance(x, str)
14     else [])
15 df["ground_truths"] = df["ground_truths"].apply(
16     lambda x: [x] if isinstance(x, str) else []
17 )

Evaluating the RAG Pipeline

Now that we have our test set, as well as our evaluation set, we can use the evaluate module to evaluate our RAG pipeline. The evaluate module provides a variety of evaluation metrics, including answer relevancy, faithfulness, context recall, and context precision.

answer_relevancy: Measures the relevancy of the answer to the question.
faithfulness: Measures the faithfulness of the answer to the provided context.
context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.
context_precision: measures the ability of the retriever to retrieve the necessary information needed to answer the question. How relevant are the retrieved contexts?

Running the validation is - now that we prepared everything - quite simple. Call the evaluate - method and provide the output of your RAG pipeline - as well as some ground truths as parameters.

1result = evaluate(
2    Dataset.from_pandas(df),
3    metrics=[
4        context_precision,
5        faithfulness,
6        answer_relevancy,
7        context_recall,
8    ],
9    llm=gpt4,
10)

The result will be a dictionary with the above-mentioned evaluation metrics and their values. You can then use these values to compare different RAG pipelines and see how they perform.

Bringing your own prompts

Ragas comes with a nice default set of LLM prompts used for both, test set generation and evaluation. However, as you get more experienced with your RAG pipeline and therefore also with how to evaluate it, you might want to bring your own prompts. Ragas provides an interface for that. As this interface seems to be changing right as I'm writing this, I'm linking to the appropriate place in ragas' documentation: Bringing your own prompt

Best practices

Use a Good Language Model (LLM)

The quality of the LLM used for test set generation as well as evaluation is directly related to the quality of the evaluation. Make sure to use a good LLM, like GPT-4. While this incurs additional costs, it's worth it even in the short run.
Standardized Question Set Use a consistent set of questions and ground truths for your evaluation. This standardization helps in accurately measuring the improvement or regression in model performance over time.
Relevance and Accuracy Ensure that the questions and ground truths are carefully curated to represent the kind of queries your RAG model will encounter in deployment. This enhances the validity of the evaluation.
Iterative Validation and Averaging Validation should not be a one-off process - as LLMs are by nature probabilistic. Therefore, each run will yield at least somewhat different results - even if you keep the RAG pipeline itself constant. Run the validation multiple times to account for variability in performance. Specifically, take the average of the worst three results from your validation runs. This conservative approach ensures that your model's performance is robust and reliable under less-than-ideal conditions.
Continuous Integration and Regular Validation Iterative Improvement After each iteration on your RAG pipeline, re-run the validation to assess the impact of changes. This continuous feedback loop is essential for incremental improvement. Automate the validation process by integrating ragas into your Continuous Integration (CI) pipeline. This ensures that any modifications to the RAG pipeline undergo validation automatically, facilitating a more efficient development workflow.
Development of Custom Validation Prompts Over time, develop and refine your own set of validation prompts to use with ragas. These custom prompts should be reflective of the unique challenges and nuances of your application domain. Regularly update and expand your validation prompts to cover new topics, formats, and query types. This ensures that the RAG model remains effective as the nature of the data or the application evolves.

Following these best practices can significantly enhance the validation process for RAG pipelines, leading to more accurate, reliable, and effective natural language generation and retrieval systems. By focusing on the quality of the used evaluation-LLM, maintaining consistency in evaluation, iterating based on validation results, and customizing validation prompts, developers can ensure their RAG systems are well-equipped to meet the demands of diverse and complex real-world applications.

Conclusion

By offering a structured approach to assess accuracy, relevance, and efficiency, ragas plays a crucial role in the development of RAG technologies, as it allows to drive and ensure high-quality AI systems.

The insights gained from utilizing ragas can drive significant advancements in the development of such systems, ensuring they remain at the cutting edge of what's possible - and as a matter of fact prevent unnecessary mistakes and performance degradations. As we continue to explore the capabilities of RAG and its impact on AI research and development, the importance of tools like ragas cannot be overstated. They not only facilitate a deeper understanding of the cogs and screws involved in RAG systems but also highlight the necessity for ongoing innovation and standardization within the field.

Looking ahead, the future of AI and LLMs is set for further transformation as we refine our approaches to integrating external knowledge sources. The journey of discovery and enhancement in RAG technology is far from over, with each iteration bringing us closer to creating AI systems that truly understand and interact with the world in a meaningful way. Through the diligent application of tools like ragas and a commitment to excellence in research and development, the possibilities for what AI can achieve are boundless. As we continue to push the boundaries of AI capabilities, the role of RAG and the continuous improvement it necessitates will undoubtedly remain at the forefront of this exciting and ever-evolving field.

TLDR: Code to validate RAG pipelines using ragas

1# ! pip install openai langchain ragas==0.1.0rc1
2import os
3from langchain_community.document_loaders import DirectoryLoader
4from langchain_community.chat_models.openai import ChatOpenAI
5
6from ragas.testset.generator import TestsetGenerator
7from ragas.testset.evolutions import simple, reasoning, multi_context
8from ragas.metrics import (
9    answer_relevancy,
10    faithfulness,
11    context_recall,
12    context_precision,
13)
14import pandas as pd
15from ragas import evaluate
16from datasets import Dataset
17
18os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxx"
19
20gpt4 = ChatOpenAI(model_name="gpt-4-turbo-preview")
21
22loader = DirectoryLoader("public/blog/validate-rag/demo-docs")
23documents = loader.load()
24
25for document in documents:
26    document.metadata["file_name"] = document.metadata["source"]
27
28generator = TestsetGenerator.with_openai()
29testset = generator.generate_with_langchain_docs(
30    documents,
31    test_size=10,
32    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
33)
34
35test_df = testset.to_pandas()
36
37# Add your actual RAG pipeline retrieval here.
38# We use the created test set as our 'RAG pipeline' - just for demo purposes.
39# In the dataframe below, change answer and contexts column to the output of your
40# RAG pipeline.
41test_questions = test_df["question"].values.tolist()
42test_answers = test_df["answer"].values.tolist()
43
44df = pd.DataFrame(
45    {
46        "question": test_questions,
47        "answer": test_answers,
48        "contexts": test_df["context"].values.tolist(),
49        "ground_truths": test_df["answer"].values.tolist(),
50    }
51)
52
53df["contexts"] = df["contexts"].apply(lambda x: [x] if isinstance(x, str) else [])
54df["ground_truths"] = df["ground_truths"].apply(
55    lambda x: [x] if isinstance(x, str) else []
56)
57
58result = evaluate(
59    Dataset.from_pandas(df),
60    metrics=[
61        context_precision,
62        faithfulness,
63        answer_relevancy,
64        context_recall,
65    ],
66    llm=gpt4,
67)
68result

Get our Newsletter!

The latest on AI, RAG, and data

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide