Evaluate a simple RAG system
In this tutorial, we will write a simple evaluation pipeline to evaluate a RAG (Retrieval-Augmented Generation) system. At the end of this tutorial, you’ll learn how to evaluate and iterate on a RAG system using evaluation-driven development.
flowchart LR
A["Query<br/>'What is Ragas 0.3?'"] --> B[Retrieval System]
C[Document Corpus<br/> Ragas 0.3 Docs📄] --> B
B --> D[LLM + Prompt]
A --> D
D --> E[Final Answer]
We will start by writing a simple RAG system that retrieves relevant documents from a corpus and generates an answer using an LLM.
Next, we will write down a few sample queries and expected outputs for our RAG system. Then convert them to a CSV file.
import pandas as pd
samples = [
{"query": "What is Ragas 0.3?", "grading_notes": "- Ragas 0.3 is a library for evaluating LLM applications."},
{"query": "How to install Ragas?", "grading_notes": "- install from source - install from pip using ragas[examples]"},
{"query": "What are the main features of Ragas?", "grading_notes": "organised around - experiments - datasets - metrics."}
]
pd.DataFrame(samples).to_csv("datasets/test_dataset.csv", index=False)
To evaluate the performance of our RAG system, we will define a llm based metric that compares the output of our RAG system with the grading notes and outputs pass/fail based on it.
from ragas.metrics import DiscreteMetric
my_metric = DiscreteMetric(
name="correctness",
prompt = "Check if the response contains points mentioned from the grading notes and return 'pass' or 'fail'.\nResponse: {response} Grading Notes: {grading_notes}",
allowed_values=["pass", "fail"],
)
Next, we will write the experiment loop that will run our RAG system on the test dataset and evaluate it using the metric, and store the results in a CSV file.
@experiment()
async def run_experiment(row):
response = rag_client.query(row["query"])
score = my_metric.score(
llm=llm,
response=response.get("answer", " "),
grading_notes=row["grading_notes"]
)
experiment_view = {
**row,
"response": response.get("answer", ""),
"score": score.value,
"log_file": response.get("logs", " "),
}
return experiment_view
Now whenever you make a change to your RAG pipeline, you can run the experiment and see how it affects the performance of your RAG.
Running the example end to end
- Setup your OpenAI API key
- Run the evaluation
Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the experiments/experiment_name.csv file.