Metrics
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
Metric
dataclass
Metric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
init
abstractmethod
init(run_config: RunConfig) -> None
Initialize the metric with the given run configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_config
|
RunConfig
|
Configuration for the metric run including timeouts and other settings. |
required |
Source code in src/ragas/metrics/base.py
score
Calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore or multi_turn_ascore instead.
Source code in src/ragas/metrics/base.py
ascore
async
Asynchronously calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore instead.
Source code in src/ragas/metrics/base.py
MetricWithLLM
dataclass
MetricWithLLM(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: Metric, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. Both BaseRagasLLM and InstructorBaseRagasLLM are accepted at runtime via duck typing (both have compatible methods). |
init
init(run_config: RunConfig) -> None
Initialize the metric with run configuration and validate LLM is present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_config
|
RunConfig
|
Configuration for the metric run. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no LLM is provided to the metric. |
Source code in src/ragas/metrics/base.py
train
train(path: str, demonstration_config: Optional[DemonstrationConfig] = None, instruction_config: Optional[InstructionConfig] = None, callbacks: Optional[Callbacks] = None, run_config: Optional[RunConfig] = None, batch_size: Optional[int] = None, with_debugging_logs=False, raise_exceptions: bool = True) -> None
Train the metric using local JSON data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to local JSON training data file |
required |
demonstration_config
|
DemonstrationConfig
|
Configuration for demonstration optimization |
None
|
instruction_config
|
InstructionConfig
|
Configuration for instruction optimization |
None
|
callbacks
|
Callbacks
|
List of callback functions |
None
|
run_config
|
RunConfig
|
Run configuration |
None
|
batch_size
|
int
|
Batch size for training |
None
|
with_debugging_logs
|
bool
|
Enable debugging logs |
False
|
raise_exceptions
|
bool
|
Whether to raise exceptions during training |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If path is not provided or not a JSON file |
Source code in src/ragas/metrics/base.py
SingleTurnMetric
dataclass
SingleTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(sample: SingleTurnSample, callbacks: Callbacks = None) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(sample: SingleTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(sample: MultiTurnSample, callbacks: Callbacks = None) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(sample: MultiTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
Ensember
Combine multiple llm outputs for same input (n>1) to a single output
from_discrete
Simple majority voting for binary values, ie [0,0,1] -> 0 inputs: list of list of dicts each containing verdict for a single input
Source code in src/ragas/metrics/base.py
SimpleBaseMetric
dataclass
Bases: ABC
Base class for simple metrics that return MetricResult objects.
This class provides the foundation for metrics that evaluate inputs and return structured MetricResult objects containing scores and reasoning.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
allowed_values |
AllowedValuesType
|
Allowed values for the metric output. Can be a list of strings for discrete metrics, a tuple of floats for numeric metrics, or an integer for ranking metrics. |
Examples:
>>> from ragas.metrics import discrete_metric
>>>
>>> @discrete_metric(name="sentiment", allowed_values=["positive", "negative"])
>>> def sentiment_metric(user_input: str, response: str) -> str:
... return "positive" if "good" in response else "negative"
>>>
>>> result = sentiment_metric(user_input="How are you?", response="I'm good!")
>>> print(result.value) # "positive"
score
abstractmethod
Synchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
ascore
abstractmethod
async
Asynchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
batch_score
Synchronously calculate scores for a batch of inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
abatch_score
async
Asynchronously calculate scores for a batch of inputs in parallel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
SimpleLLMMetric
dataclass
SimpleLLMMetric(name: str, allowed_values: AllowedValuesType = (lambda: ['pass', 'fail'])(), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleBaseMetric
LLM-based metric that uses prompts to generate structured responses.
save
Save the metric configuration to a JSON file.
Parameters:
path : str, optional File path to save to. If not provided, saves to "./{metric.name}.json" Use .gz extension for compression.
Note:
If the metric has a response_model, its schema will be saved for reference but the model itself cannot be serialized. You'll need to provide it when loading.
Examples:
All these work:
metric.save() # → ./response_quality.json metric.save("custom.json") # → ./custom.json metric.save("/path/to/metrics/") # → /path/to/metrics/response_quality.json metric.save("no_extension") # → ./no_extension.json metric.save("compressed.json.gz") # → ./compressed.json.gz (compressed)
Source code in src/ragas/metrics/base.py
1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 | |
load
classmethod
load(path: str, response_model: Optional[Type['BaseModel']] = None, embedding_model: Optional['EmbeddingModelType'] = None) -> 'SimpleLLMMetric'
Load a metric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. response_model : Optional[Type[BaseModel]] Pydantic model to use for response validation. Required for custom SimpleLLMMetrics. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
SimpleLLMMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded, is invalid, or missing required models
Source code in src/ragas/metrics/base.py
1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 | |
get_correlation
abstractmethod
Calculate the correlation between gold scores and predicted scores. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/base.py
align_and_validate
align_and_validate(dataset: 'Dataset', embedding_model: 'EmbeddingModelType', llm: 'BaseRagasLLM', test_size: float = 0.2, random_state: int = 42, **kwargs: Dict[str, Any])
Args: dataset: experiment to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting. llm: The LLM instance to use for scoring.
Align the metric with the specified experiments and validate it against a gold standard experiment. This method combines alignment and validation into a single step.
Source code in src/ragas/metrics/base.py
align
Args: train_dataset: train_dataset to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting.
Align the metric with the specified experiments by different optimization methods.
Source code in src/ragas/metrics/base.py
1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 | |
validate_alignment
Args: llm: The LLM instance to use for scoring. test_dataset: An Dataset instance containing the gold standard scores. mapping: A dictionary mapping variable names expected by metrics to their corresponding names in the gold experiment.
Validate the alignment of the metric by comparing the scores against a gold standard experiment. This method computes the Cohen's Kappa score and agreement rate between the gold standard scores and the predicted scores from the metric.
Source code in src/ragas/metrics/base.py
create_auto_response_model
Create a response model and mark it as auto-generated by Ragas.
This function creates a Pydantic model using create_model and marks it with a special attribute to indicate it was auto-generated. This allows the save() method to distinguish between auto-generated models (which are recreated on load) and custom user models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the model class |
required |
**fields
|
Field definitions in create_model format. Each field is specified as: field_name=(type, default_or_field_info) |
{}
|
Returns:
| Type | Description |
|---|---|
Type[BaseModel]
|
Pydantic model class marked as auto-generated |
Examples:
>>> from pydantic import Field
>>> # Simple model with required fields
>>> ResponseModel = create_auto_response_model(
... "ResponseModel",
... value=(str, ...),
... reason=(str, ...)
... )
>>>
>>> # Model with Field validators and descriptions
>>> ResponseModel = create_auto_response_model(
... "ResponseModel",
... value=(str, Field(..., description="The predicted value")),
... reason=(str, Field(..., description="Reasoning for the prediction"))
... )
Source code in src/ragas/metrics/base.py
AnswerCorrectness
dataclass
AnswerCorrectness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'reference'}})(), name: str = 'answer_correctness', embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding]] = None, llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, correctness_prompt: PydanticPrompt = CorrectnessClassifier(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), weights: list[float] = (lambda: [0.75, 0.25])(), beta: float = 1.0, answer_similarity: Optional[AnswerSimilarity] = None, max_retries: int = 1)
Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric
Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
string
|
The name of the metrics |
weights |
list[float]
|
a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25] |
answer_similarity |
Optional[AnswerSimilarity]
|
The AnswerSimilarity object |
ResponseRelevancy
dataclass
ResponseRelevancy(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response'}})(), name: str = 'answer_relevancy', embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding]] = None, llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, question_generation: PydanticPrompt = ResponseRelevancePrompt(), strictness: int = 3)
Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric
Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
string
|
The name of the metrics |
strictness |
int
|
Here indicates the number questions generated per answer. Ideal range between 3 to 5. |
embeddings |
Embedding
|
The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings('BAAI/bge-base-en') |
SemanticSimilarity
dataclass
SemanticSimilarity(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'reference', 'response'}})(), name: str = 'semantic_similarity', embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding]] = None, is_cross_encoder: bool = False, threshold: Optional[float] = None)
Bases: MetricWithEmbeddings, SingleTurnMetric
Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
model_name |
The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard |
|
threshold |
Optional[float]
|
The threshold if given used to map output to binary Default 0.5 |
AspectCritic
AspectCritic(name: str, definition: str, llm: Optional[BaseRagasLLM] = None, required_columns: Optional[Dict[MetricType, Set[str]]] = None, output_type: Optional[MetricOutputType] = BINARY, single_turn_prompt: Optional[PydanticPrompt] = None, multi_turn_prompt: Optional[PydanticPrompt] = None, strictness: int = 1, max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric, MultiTurnMetric
Judges the submission to give binary results using the criteria specified in the metric definition.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
name of the metrics |
definition |
str
|
criteria to judge the submission, example "Is the submission spreading fake information?" |
strictness |
int
|
The number of times self consistency checks is made. Final judgement is made using majority vote. |
Source code in src/ragas/metrics/_aspect_critic.py
ContextEntityRecall
dataclass
ContextEntityRecall(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'reference', 'retrieved_contexts'}})(), name: str = 'context_entity_recall', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, context_entity_recall_prompt: PydanticPrompt = ExtractEntitiesPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.
Then we define can the context entity recall as follows: Context Entity recall = | CN ∩ GN | / | GN |
If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
batch_size |
int
|
Batch size for openai completion. |
IDBasedContextPrecision
dataclass
IDBasedContextPrecision(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'retrieved_context_ids', 'reference_context_ids'}})(), name: str = 'id_based_context_precision', output_type: MetricOutputType = CONTINUOUS)
Bases: SingleTurnMetric
Calculates context precision by directly comparing retrieved context IDs with reference context IDs. The score represents what proportion of the retrieved context IDs are actually relevant (present in reference).
This metric works with both string and integer IDs.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Name of the metric |
LLMContextPrecisionWithReference
dataclass
LLMContextPrecisionWithReference(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'retrieved_contexts', 'reference'}})(), name: str = 'llm_context_precision_with_reference', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None, context_precision_prompt: PydanticPrompt = ContextPrecisionPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
evaluation_mode |
EvaluationMode
|
|
context_precision_prompt |
Prompt
|
|
IDBasedContextRecall
dataclass
IDBasedContextRecall(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'retrieved_context_ids', 'reference_context_ids'}})(), name: str = 'id_based_context_recall', output_type: MetricOutputType = CONTINUOUS)
Bases: SingleTurnMetric
Calculates context recall by directly comparing retrieved context IDs with reference context IDs. The score represents what proportion of the reference IDs were successfully retrieved.
This metric works with both string and integer IDs.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Name of the metric |
LLMContextRecall
dataclass
LLMContextRecall(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'retrieved_contexts', 'reference'}})(), name: str = 'context_recall', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, context_recall_prompt: PydanticPrompt = ContextRecallClassificationPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
FactualCorrectness
dataclass
FactualCorrectness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'response', 'reference'}})(), name: str = 'factual_correctness', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, mode: Literal['precision', 'recall', 'f1'] = 'f1', beta: float = 1.0, atomicity: Literal['low', 'high'] = 'low', coverage: Literal['low', 'high'] = 'low', claim_decomposition_prompt: PydanticPrompt = ClaimDecompositionPrompt(), nli_prompt: PydanticPrompt = NLIStatementPrompt(), language: str = 'english')
Bases: MetricWithLLM, SingleTurnMetric
FactualCorrectness is a metric class that evaluates the factual correctness of responses generated by a language model. It uses claim decomposition and natural language inference (NLI) to verify the claims made in the responses against reference texts.
Attributes: name (str): The name of the metric, default is "factual_correctness". _required_columns (Dict[MetricType, Set[str]]): A dictionary specifying the required columns for each metric type. Default is {"SINGLE_TURN": {"response", "reference"}}. mode (Literal["precision", "recall", "f1"]): The mode of evaluation, can be "precision", "recall", or "f1". Default is "f1". beta (float): The beta value used for the F1 score calculation. A beta > 1 gives more weight to recall, while beta < 1 favors precision. Default is 1.0. atomicity (Literal["low", "high"]): The level of atomicity for claim decomposition. Default is "low". coverage (Literal["low", "high"]): The level of coverage for claim decomposition. Default is "low". claim_decomposition_prompt (PydanticPrompt): The prompt used for claim decomposition. nli_prompt (PydanticPrompt): The prompt used for natural language inference (NLI).
Faithfulness
dataclass
Faithfulness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'retrieved_contexts'}})(), name: str = 'faithfulness', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, nli_statements_prompt: PydanticPrompt = NLIStatementPrompt(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
FaithfulnesswithHHEM
dataclass
FaithfulnesswithHHEM(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'retrieved_contexts'}})(), name: str = 'faithfulness_with_hhem', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, nli_statements_prompt: PydanticPrompt = NLIStatementPrompt(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), max_retries: int = 1, device: str = 'cpu', batch_size: int = 10)
Bases: Faithfulness
NoiseSensitivity
dataclass
NoiseSensitivity(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'reference', 'retrieved_contexts'}})(), name: str = 'noise_sensitivity', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = CONTINUOUS, mode: Literal['relevant', 'irrelevant'] = 'relevant', nli_statements_prompt: PydanticPrompt = NLIStatementPrompt(), statement_generator_prompt: PydanticPrompt = StatementGeneratorPrompt(), max_retries: int = 1)
Bases: MetricWithLLM, SingleTurnMetric
AnswerAccuracy
dataclass
AnswerAccuracy(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'response', 'reference'}})(), name: str = 'nv_accuracy', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: MetricWithLLM, SingleTurnMetric
Measures answer accuracy compared to ground truth given a user_input. This metric averages two distinct judge prompts to evaluate.
Top10, Zero-shoot LLM-as-a-Judge Leaderboard: 1)- nvidia/Llama-3_3-Nemotron-Super-49B-v1 2)- mistralai/mixtral-8x22b-instruct-v0.1 3)- mistralai/mixtral-8x7b-instruct-v0.1 4)- meta/llama-3.1-70b-instruct 5)- meta/llama-3.3-70b-instruct 6)- meta/llama-3.1-405b-instruct 7)- mistralai/mistral-nemo-12b-instruct 8)- nvidia/llama-3.1-nemotron-70b-instruct 9)- meta/llama-3.1-8b-instruct 10)- google/gemma-2-2b-it The top1 LB model have high correlation with human judges (~0.92).
Attributes:
| Name | Type | Description |
|---|---|---|
name |
string
|
The name of the metrics |
answer_accuracy |
The AnswerAccuracy object |
ContextRelevance
dataclass
ContextRelevance(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'user_input', 'retrieved_contexts'}})(), name: str = 'nv_context_relevance', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: MetricWithLLM, SingleTurnMetric
Parameters: Score the relevance of the retrieved contexts be based on the user input.
Input: data: list of Dicts with keys: user_input, retrieved_contexts Output: 0.0: retrieved_contexts is not relevant for the user_input 0.5: retrieved_contexts is partially relevant for the user_input 1.0: retrieved_contexts is fully relevant for the user_input
ResponseGroundedness
dataclass
ResponseGroundedness(_required_columns: Dict[MetricType, Set[str]] = (lambda: {SINGLE_TURN: {'response', 'retrieved_contexts'}})(), name: str = 'nv_response_groundedness', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: MetricWithLLM, SingleTurnMetric
Parameters: Score the groundedness of the response based on the retrieved contexts.
Input: data: list of Dicts with keys: response, retrieved contexts Output: 0.0: response is not grounded in the retrieved contexts 0.5: response is partially grounded in the retrieved contexts 1.0: response is fully grounded in the retrieved contexts
SimpleCriteriaScore
SimpleCriteriaScore(name: str, definition: str, llm: Optional[BaseRagasLLM] = None, required_columns: Optional[Dict[MetricType, Set[str]]] = None, output_type: Optional[MetricOutputType] = DISCRETE, single_turn_prompt: Optional[PydanticPrompt] = None, multi_turn_prompt: Optional[PydanticPrompt] = None, strictness: int = 1)
Bases: MetricWithLLM, SingleTurnMetric, MultiTurnMetric
Judges the submission to give binary results using the criteria specified in the metric definition.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
name of the metrics |
definition |
str
|
criteria to score the submission |
strictness |
int
|
The number of times self consistency checks is made. Final judgement is made using majority vote. |
Source code in src/ragas/metrics/_simple_criteria.py
ToolCallAccuracy
dataclass
ToolCallAccuracy(_required_columns: Dict[MetricType, Set[str]] = (lambda: {MULTI_TURN: {'user_input', 'reference_tool_calls'}})(), name: str = 'tool_call_accuracy', strict_order: bool = True, arg_comparison_metric: SingleTurnMetric = (lambda: ExactMatch())())
Bases: MultiTurnMetric
Tool Call Accuracy metric measures how accurately an LLM agent makes tool calls compared to reference tool calls.
The metric supports two evaluation modes: 1. Strict order (default): Tool calls must match exactly in sequence 2. Flexible order: Tool calls can be in any order (parallel evaluation)
The metric evaluates two aspects: 1. Sequence alignment: Whether predicted and reference tool calls match in the required order 2. Argument accuracy: How well tool call arguments match between predicted and reference
Score calculation: - If sequences don't align: score = 0 - If sequences align: score = (average argument accuracy) * sequence_alignment_factor - Length mismatches result in warnings and proportional penalty
Edge cases: - No predicted tool calls: returns 0.0 - Length mismatch: compares only the overlapping portion and applies coverage penalty - Missing arguments: contributes 0 to the argument score for that tool call
The final score is always between 0.0 and 1.0.
Args: strict_order: If True (default), tool calls must match exactly in sequence. If False, tool calls can be in any order (parallel evaluation).
Metric
dataclass
Metric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
init
abstractmethod
init(run_config: RunConfig) -> None
Initialize the metric with the given run configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_config
|
RunConfig
|
Configuration for the metric run including timeouts and other settings. |
required |
Source code in src/ragas/metrics/base.py
score
Calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore or multi_turn_ascore instead.
Source code in src/ragas/metrics/base.py
ascore
async
Asynchronously calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore instead.
Source code in src/ragas/metrics/base.py
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
| Name | Type | Description |
|---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
MetricWithLLM
dataclass
MetricWithLLM(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '', llm: Optional[BaseRagasLLM] = None, output_type: Optional[MetricOutputType] = None)
Bases: Metric, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. Both BaseRagasLLM and InstructorBaseRagasLLM are accepted at runtime via duck typing (both have compatible methods). |
init
init(run_config: RunConfig) -> None
Initialize the metric with run configuration and validate LLM is present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_config
|
RunConfig
|
Configuration for the metric run. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no LLM is provided to the metric. |
Source code in src/ragas/metrics/base.py
train
train(path: str, demonstration_config: Optional[DemonstrationConfig] = None, instruction_config: Optional[InstructionConfig] = None, callbacks: Optional[Callbacks] = None, run_config: Optional[RunConfig] = None, batch_size: Optional[int] = None, with_debugging_logs=False, raise_exceptions: bool = True) -> None
Train the metric using local JSON data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to local JSON training data file |
required |
demonstration_config
|
DemonstrationConfig
|
Configuration for demonstration optimization |
None
|
instruction_config
|
InstructionConfig
|
Configuration for instruction optimization |
None
|
callbacks
|
Callbacks
|
List of callback functions |
None
|
run_config
|
RunConfig
|
Run configuration |
None
|
batch_size
|
int
|
Batch size for training |
None
|
with_debugging_logs
|
bool
|
Enable debugging logs |
False
|
raise_exceptions
|
bool
|
Whether to raise exceptions during training |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If path is not provided or not a JSON file |
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(sample: MultiTurnSample, callbacks: Callbacks = None) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(sample: MultiTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
BaseMetric
dataclass
Bases: ABC
Base class for simple metrics that return MetricResult objects.
This class provides the foundation for metrics that evaluate inputs and return structured MetricResult objects containing scores and reasoning.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
allowed_values |
AllowedValuesType
|
Allowed values for the metric output. Can be a list of strings for discrete metrics, a tuple of floats for numeric metrics, or an integer for ranking metrics. |
Examples:
>>> from ragas.metrics import discrete_metric
>>>
>>> @discrete_metric(name="sentiment", allowed_values=["positive", "negative"])
>>> def sentiment_metric(user_input: str, response: str) -> str:
... return "positive" if "good" in response else "negative"
>>>
>>> result = sentiment_metric(user_input="How are you?", response="I'm good!")
>>> print(result.value) # "positive"
score
abstractmethod
Synchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
ascore
abstractmethod
async
Asynchronously calculate the metric score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
dict
|
Input parameters required by the specific metric implementation. |
{}
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
The evaluation result containing the score and reasoning. |
Source code in src/ragas/metrics/base.py
batch_score
Synchronously calculate scores for a batch of inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
abatch_score
async
Asynchronously calculate scores for a batch of inputs in parallel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
List[Dict[str, Any]]
|
List of input dictionaries, each containing parameters for the metric. |
required |
Returns:
| Type | Description |
|---|---|
List[MetricResult]
|
List of evaluation results, one for each input. |
Source code in src/ragas/metrics/base.py
LLMMetric
dataclass
LLMMetric(name: str, allowed_values: AllowedValuesType = (lambda: ['pass', 'fail'])(), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleBaseMetric
LLM-based metric that uses prompts to generate structured responses.
save
Save the metric configuration to a JSON file.
Parameters:
path : str, optional File path to save to. If not provided, saves to "./{metric.name}.json" Use .gz extension for compression.
Note:
If the metric has a response_model, its schema will be saved for reference but the model itself cannot be serialized. You'll need to provide it when loading.
Examples:
All these work:
metric.save() # → ./response_quality.json metric.save("custom.json") # → ./custom.json metric.save("/path/to/metrics/") # → /path/to/metrics/response_quality.json metric.save("no_extension") # → ./no_extension.json metric.save("compressed.json.gz") # → ./compressed.json.gz (compressed)
Source code in src/ragas/metrics/base.py
1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 | |
load
classmethod
load(path: str, response_model: Optional[Type['BaseModel']] = None, embedding_model: Optional['EmbeddingModelType'] = None) -> 'SimpleLLMMetric'
Load a metric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. response_model : Optional[Type[BaseModel]] Pydantic model to use for response validation. Required for custom SimpleLLMMetrics. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
SimpleLLMMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded, is invalid, or missing required models
Source code in src/ragas/metrics/base.py
1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 | |
get_correlation
abstractmethod
Calculate the correlation between gold scores and predicted scores. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/base.py
align_and_validate
align_and_validate(dataset: 'Dataset', embedding_model: 'EmbeddingModelType', llm: 'BaseRagasLLM', test_size: float = 0.2, random_state: int = 42, **kwargs: Dict[str, Any])
Args: dataset: experiment to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting. llm: The LLM instance to use for scoring.
Align the metric with the specified experiments and validate it against a gold standard experiment. This method combines alignment and validation into a single step.
Source code in src/ragas/metrics/base.py
align
Args: train_dataset: train_dataset to align the metric with. embedding_model: The embedding model used for dynamic few-shot prompting.
Align the metric with the specified experiments by different optimization methods.
Source code in src/ragas/metrics/base.py
1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 | |
validate_alignment
Args: llm: The LLM instance to use for scoring. test_dataset: An Dataset instance containing the gold standard scores. mapping: A dictionary mapping variable names expected by metrics to their corresponding names in the gold experiment.
Validate the alignment of the metric by comparing the scores against a gold standard experiment. This method computes the Cohen's Kappa score and agreement rate between the gold standard scores and the predicted scores from the metric.
Source code in src/ragas/metrics/base.py
SingleTurnMetric
dataclass
SingleTurnMetric(_required_columns: Dict[MetricType, Set[str]] = dict(), name: str = '')
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(sample: SingleTurnSample, callbacks: Callbacks = None) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(sample: SingleTurnSample, callbacks: Callbacks = None, timeout: Optional[float] = None) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
DiscreteMetric
dataclass
DiscreteMetric(name: str, allowed_values: List[str] = (lambda: ['pass', 'fail'])(), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleLLMMetric, DiscreteValidator
Metric for categorical/discrete evaluations with predefined allowed values.
This class is used for metrics that output categorical values like "pass/fail", "good/bad/excellent", or custom discrete categories.
Attributes:
| Name | Type | Description |
|---|---|---|
allowed_values |
List[str]
|
List of allowed categorical values the metric can output. Default is ["pass", "fail"]. |
Examples:
>>> from ragas.metrics import DiscreteMetric
>>> from ragas.llms import LangchainLLMWrapper
>>> from langchain_openai import ChatOpenAI
>>>
>>> # Create a custom discrete metric
>>> llm = LangchainLLMWrapper(ChatOpenAI())
>>> metric = DiscreteMetric(
... name="quality_check",
... llm=llm,
... allowed_values=["excellent", "good", "poor"]
... )
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/discrete.py
load
classmethod
load(path: str, embedding_model: Optional[EmbeddingModelType] = None) -> DiscreteMetric
Load a DiscreteMetric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
DiscreteMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded or is not a DiscreteMetric
Source code in src/ragas/metrics/discrete.py
NumericMetric
dataclass
NumericMetric(name: str, allowed_values: Union[Tuple[float, float], range] = (0.0, 1.0), prompt: Optional[Union[str, 'Prompt']] = None)
Bases: SimpleLLMMetric, NumericValidator
Metric for continuous numeric evaluations within a specified range.
This class is used for metrics that output numeric scores within a defined range, such as 0.0 to 1.0 for similarity scores or 1-10 ratings.
Attributes:
| Name | Type | Description |
|---|---|---|
allowed_values |
Union[Tuple[float, float], range]
|
The valid range for metric outputs. Can be a tuple of (min, max) floats or a range object. Default is (0.0, 1.0). |
Examples:
>>> from ragas.metrics import NumericMetric
>>> from ragas.llms import LangchainLLMWrapper
>>> from langchain_openai import ChatOpenAI
>>>
>>> # Create a custom numeric metric with 0-10 range
>>> llm = LangchainLLMWrapper(ChatOpenAI())
>>> metric = NumericMetric(
... name="quality_score",
... llm=llm,
... allowed_values=(0.0, 10.0)
... )
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/numeric.py
load
classmethod
load(path: str, embedding_model: Optional[EmbeddingModelType] = None) -> NumericMetric
Load a NumericMetric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
NumericMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded or is not a NumericMetric
Source code in src/ragas/metrics/numeric.py
RankingMetric
dataclass
Bases: SimpleLLMMetric, RankingValidator
Metric for evaluations that produce ranked lists of items.
This class is used for metrics that output ordered lists, such as ranking search results, prioritizing features, or ordering responses by relevance.
Attributes:
| Name | Type | Description |
|---|---|---|
allowed_values |
int
|
Expected number of items in the ranking list. Default is 2. |
Examples:
>>> from ragas.metrics import RankingMetric
>>> from ragas.llms import LangchainLLMWrapper
>>> from langchain_openai import ChatOpenAI
>>>
>>> # Create a ranking metric that returns top 3 items
>>> llm = LangchainLLMWrapper(ChatOpenAI())
>>> metric = RankingMetric(
... name="relevance_ranking",
... llm=llm,
... allowed_values=3
... )
get_correlation
Calculate the correlation between gold labels and predictions. This is a placeholder method and should be implemented based on the specific metric.
Source code in src/ragas/metrics/ranking.py
load
classmethod
load(path: str, embedding_model: Optional[EmbeddingModelType] = None) -> RankingMetric
Load a RankingMetric from a JSON file.
Parameters:
path : str File path to load from. Supports .gz compressed files. embedding_model : Optional[Any] Embedding model for DynamicFewShotPrompt. Required if the original used one.
Returns:
RankingMetric Loaded metric instance
Raises:
ValueError If file cannot be loaded or is not a RankingMetric
Source code in src/ragas/metrics/ranking.py
MetricResult
Class to hold the result of a metric evaluation.
This class behaves like its underlying result value but still provides access to additional metadata like reasoning.
Works with: - DiscreteMetrics (string results) - NumericMetrics (float/int results) - RankingMetrics (list results)
Source code in src/ragas/metrics/result.py
to_dict
validate
classmethod
Provide compatibility with older Pydantic versions.
discrete_metric
discrete_metric(*, name: Optional[str] = None, allowed_values: Optional[List[str]] = None, **metric_params: Any) -> Callable[[Callable[..., Any]], DiscreteMetricProtocol]
Decorator for creating discrete/categorical metrics.
This decorator transforms a regular function into a DiscreteMetric instance that can be used for evaluation with predefined categorical outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. If not provided, uses the function name. |
None
|
allowed_values
|
List[str]
|
List of allowed categorical values for the metric output. Default is ["pass", "fail"]. |
None
|
**metric_params
|
Any
|
Additional parameters to pass to the metric initialization. |
{}
|
Returns:
| Type | Description |
|---|---|
Callable[[Callable[..., Any]], DiscreteMetricProtocol]
|
A decorator that transforms a function into a DiscreteMetric instance. |
Examples:
>>> from ragas.metrics import discrete_metric
>>>
>>> @discrete_metric(name="sentiment", allowed_values=["positive", "neutral", "negative"])
>>> def sentiment_analysis(user_input: str, response: str) -> str:
... '''Analyze sentiment of the response.'''
... if "great" in response.lower() or "good" in response.lower():
... return "positive"
... elif "bad" in response.lower() or "poor" in response.lower():
... return "negative"
... return "neutral"
>>>
>>> result = sentiment_analysis(
... user_input="How was your day?",
... response="It was great!"
... )
>>> print(result.value) # "positive"
Source code in src/ragas/metrics/discrete.py
numeric_metric
numeric_metric(*, name: Optional[str] = None, allowed_values: Optional[Union[Tuple[float, float], range]] = None, **metric_params: Any) -> Callable[[Callable[..., Any]], NumericMetricProtocol]
Decorator for creating numeric/continuous metrics.
This decorator transforms a regular function into a NumericMetric instance that outputs continuous values within a specified range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. If not provided, uses the function name. |
None
|
allowed_values
|
Union[Tuple[float, float], range]
|
The valid range for metric outputs as (min, max) tuple or range object. Default is (0.0, 1.0). |
None
|
**metric_params
|
Any
|
Additional parameters to pass to the metric initialization. |
{}
|
Returns:
| Type | Description |
|---|---|
Callable[[Callable[..., Any]], NumericMetricProtocol]
|
A decorator that transforms a function into a NumericMetric instance. |
Examples:
>>> from ragas.metrics import numeric_metric
>>>
>>> @numeric_metric(name="relevance_score", allowed_values=(0.0, 1.0))
>>> def calculate_relevance(user_input: str, response: str) -> float:
... '''Calculate relevance score between 0 and 1.'''
... # Simple word overlap example
... user_words = set(user_input.lower().split())
... response_words = set(response.lower().split())
... if not user_words:
... return 0.0
... overlap = len(user_words & response_words)
... return overlap / len(user_words)
>>>
>>> result = calculate_relevance(
... user_input="What is Python?",
... response="Python is a programming language"
... )
>>> print(result.value) # Numeric score between 0.0 and 1.0
Source code in src/ragas/metrics/numeric.py
ranking_metric
ranking_metric(*, name: Optional[str] = None, allowed_values: Optional[int] = None, **metric_params: Any) -> Callable[[Callable[..., Any]], RankingMetricProtocol]
Decorator for creating ranking/ordering metrics.
This decorator transforms a regular function into a RankingMetric instance that outputs ordered lists of items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. If not provided, uses the function name. |
None
|
allowed_values
|
int
|
Expected number of items in the ranking list. Default is 2. |
None
|
**metric_params
|
Any
|
Additional parameters to pass to the metric initialization. |
{}
|
Returns:
| Type | Description |
|---|---|
Callable[[Callable[..., Any]], RankingMetricProtocol]
|
A decorator that transforms a function into a RankingMetric instance. |
Examples:
>>> from ragas.metrics import ranking_metric
>>>
>>> @ranking_metric(name="priority_ranker", allowed_values=3)
>>> def rank_by_urgency(user_input: str, responses: list) -> list:
... '''Rank responses by urgency keywords.'''
... urgency_keywords = ["urgent", "asap", "critical"]
... scored = []
... for resp in responses:
... score = sum(kw in resp.lower() for kw in urgency_keywords)
... scored.append((score, resp))
... # Sort by score descending and return top items
... ranked = sorted(scored, key=lambda x: x[0], reverse=True)
... return [item[1] for item in ranked[:3]]
>>>
>>> result = rank_by_urgency(
... user_input="What should I do first?",
... responses=["This is urgent", "Take your time", "Critical issue!"]
... )
>>> print(result.value) # Ranked list of responses