Aligning LLM Evaluators with Human Judgment
This tutorial is part of a three-part series on how to use Vertex AI models with Ragas. It is recommended that you have gone through Getting Started: Ragas with Vertex AI, even if you have not, you can easily follow this. You can navigate to the Model Comparison tutorial using the link.
Overview
In this tutorial, you will learn how to train and align your own custom LLM-based metric using Ragas. While LLM-based evaluators offer a powerful means of scoring AI applications, they can sometimes produce judgments that diverge from human expectations due to differences in style, context, or subtle nuances. By following this guide, you will refine your metric so that it more accurately mirrors human judgment.
In this tutorial, you will:
- Define a model-based metric using Ragas.
- Construct an EvaluationDataset from the "helpful" subset of the HHH dataset.
- Run an initial evaluation to benchmark the metricβs performance.
- Review and annotate 15β20 evaluation examples.
- Train the metric using your annotated data.
- Reevaluate the metric to observe improvements in alignment with human judgments.
Getting Started
Install Dependencies
Restart runtime
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.
The restart might take a minute or longer. After it's restarted, continue to the next step.
Authenticate your notebook environment (Colab only)
If you're running this notebook on Google Colab, run the cell below to authenticate your environment.
Set Google Cloud project information and initialize Vertex AI SDK
PROJECT_ID = "[your-project-id]" # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
raise ValueError("Please set your PROJECT_ID")
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)
Set up eval metrics
LLM-based metrics have tremendous potential but can sometimes misjudge responses compared to human evaluators. To bridge this gap, we align our model-based metric with human judgment using a feedback loop.
Define evaluator_llm
Import the required wrappers and define your evaluator LLM and embedder.
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings
evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-2.0-flash-001"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))
Ragas metrics
Ragas offers various model-based metrics that can be fine-tuned to align with human evaluators. For demonstration, we will use the Aspect Critic metricβa user-defined, binary metric. For further details, please refer to the Aspect Critic documentation.
from ragas.metrics import AspectCritic
helpfulness_critic = AspectCritic(
name="helpfulness",
definition="Evaluate how helpful the assistant's response is to the user's query.",
llm=evaluator_llm
)
You can preview the prompt that will be passed to the LLM (before alignment) by running:
OutputEvaluate the Input based on the criterial defined. Use only 'Yes' (1) and 'No' (0) as verdict.
Criteria Definition: Evaluate how helpful the assistant's response is to the user's query.
Defining Alignment Score
Since we are using a binary metric, we will measure the alignment using the F1-score. However, depending on the metric you are aligning, you can modify this function accordingly to use other methods to measure the alignment.
from typing import List
from sklearn.metrics import f1_score
def alignment_score(human_score: List[float], llm_score: List[float]) -> float:
"""
Computes the alignment between human-annotated binary scores and LLM-generated binary scores
using the F1-score metric.
Args:
human_score (List[int]): Binary labels from human evaluation (0 or 1).
llm_score (List[int]): Binary labels from LLM predictions (0 or 1).
Returns:
float: The F1-score measuring alignment.
"""
return f1_score(human_score, llm_score)
Prepare your dataset
The process_hhh_dataset function prepares data from the HHH dataset for use in training and aligning of the LLM evaluator. Alternate 0 and 1 scores (1 for helpful, 0 for non-helpful) are assigned to each example, indicating which response is preferred.
import numpy as np
from datasets import load_dataset
from ragas import EvaluationDataset
def process_hhh_dataset(split: str = "helpful", total_count: int = 50):
dataset = load_dataset("HuggingFaceH4/hhh_alignment",split, split=f"test[:{total_count}]")
data = []
expert_scores = []
for idx, entry in enumerate(dataset):
# Extract input and target details
user_input = entry['input']
choices = entry['targets']['choices']
labels = entry['targets']['labels']
# Choose target based on whether the index is even or odd
if idx % 2 == 0:
target_label = 1
score = 1
else:
target_label = 0
score = 0
label_index = labels.index(target_label)
response = choices[label_index]
data.append({
'user_input': user_input,
'response': response,
})
expert_scores.append(score)
return EvaluationDataset.from_list(data), expert_scores
eval_dataset, expert_scores = process_hhh_dataset()
Run evaluation
With the evaluation dataset and the helpfulness metric defined, you can now run the evaluation:
This initial run highlights the level of misalignment present in LLM-based evaluators, which the subsequent training will address.
Next, benchmark the metric's performance against the expert scores:
human_score = expert_scores
llm_score = results.to_pandas()["helpfulness"].values
initial_score = alignment_score(human_score, llm_score)
initial_score
Review and Annotate
Now that you have obtained the evaluation results, itβs time to review and annotate them. As discussed in blog Aligning LLM as judge with human evaluators, collecting detailed feedback is essential for bridging the gap between LLM-based and human evaluations. Annotate at least 15β20 examples to capture diverse scenarios where the metric might be misaligned.
Here is a sample annotation for the above example. You can download and use it.
Training and Alignment
The next step is to train your metric using the annotated examples. This training process leverages a gradient-free prompt optimization approach that adjusts both instructions and few-shot demonstrations based on the annotated feedback.
from ragas.config import InstructionConfig, DemonstrationConfig
demo_config = DemonstrationConfig(embedding=evaluator_embeddings)
inst_config = InstructionConfig(llm=evaluator_llm)
helpfulness_critic.train(
path="annotated_data.json",
instruction_config=inst_config,
demonstration_config=demo_config,
)
Overall Progress: 100%|ββββββββββ| 170/170 [00:00<?, ?it/s]
Few-shot examples [single_turn_aspect_critic_prompt]: 100%|ββββββββββ| 16/16 [00:00<?, ?it/s]
After training, review the updated instructions that have been optimized for the metric:
OutputYou are provided with a user input and an assistant/model response. Your task is to evaluate the quality of the response based on how well it addresses the user input, considering all requests and constraints. Assign a score/verdict of 1 if the response is helpful, appropriate, and effective, and 0 if it is not. A good response should be accurate, complete, relevant, and provide a tangible improvement or solution, without omitting key information. Provide a brief explanation for your score/verdict.
Re-evaluate
Now that your metric has been aligned with human feedback, re-run the evaluation on your dataset. This step allows you to benchmark the improvements and quantify how well the alignment process has enhanced the metricβs reliability.
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings
evaluator_llm = LangchainLLMWrapper(VertexAI(model_name="gemini-pro"))
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(model_name="text-embedding-004"))
Benchmark the updated results against the expert scores:
human_score = expert_scores
llm_score = results2.to_pandas()["helpfulness"].values
new_score = alignment_score(human_score, llm_score)
new_score
Checkout other tutorials of this series:
- Ragas with Vertex AI: Learn how to use Vertex AI models with Ragas to evaluate your LLM workflows.
- Model Comparison: Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics.