Metrics#
This module contains functions for evaluating the performance of a RAG assistant or a retriever.
- class flexrag.metrics.MetricsBase[source]#
- abstract compute(questions=None, responses=None, golden_responses=None, retrieved_contexts=None, golden_contexts=None)[source]#
Compute the metric value.
- Parameters:
questions (list[str], optional) – A list of questions. Defaults to None.
responses (list[str], optional) – A list of responses. Defaults to None.
golden_responses (list[list[str]], optional) – A list of golden responses. Defaults to None.
retrieved_contexts (list[list[str | RetrievedContext]], optional) – A list of retrieved contexts. Defaults to None.
golden_contexts (list[list[str]], optional) – A list of golden contexts. Defaults to None.
- Returns:
The metric scores and the metadata of the metric.
- Return type:
tuple[dict[str, float], dict]
Helper Class#
The RAGEvaluator takes a list of metrics and evaluates the performance of a RAG assistant or a retriever.
- class flexrag.metrics.EvaluatorConfig(metrics_type=<factory>, generation_bleu_config=<factory>, generation_chrf_config=<factory>, generation_em_config=<factory>, generation_accuracy_config=<factory>, generation_f1_config=<factory>, generation_recall_config=<factory>, generation_precision_config=<factory>, retrieval_success_rate_config=<factory>, retrieval_recall_config=<factory>, retrieval_precision_config=<factory>, retrieval_map_config=<factory>, retrieval_ndcg_config=<factory>, round=2)[source]#
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.Evaluator(cfg)[source]#
Bases:
object- evaluate(*, questions=None, responses=None, golden_responses=None, retrieved_contexts=None, golden_contexts=None, log=True)[source]#
Evaluate the generated responses against the ground truth responses.
- Parameters:
questions (list[str], optional) – A list of questions. Defaults to None.
responses (list[str], optional) – A list of responses. Defaults to None.
golden_responses (list[list[str]], optional) – A list of golden responses. Defaults to None.
retrieved_contexts (list[list[str | RetrievedContext]], optional) – A list of retrieved contexts. Defaults to None.
golden_contexts (list[list[str]], optional) – A list of golden contexts. Defaults to None.
log (bool, optional) – Whether to log the evaluation results. Defaults to True.
- Returns:
The evaluation results and the evaluation details.
- Return type:
tuple[dict[str, float], dict[str, Any]]
RAG Generation Metrics#
- class flexrag.metrics.BLEUConfig(tokenizer='13a')#
Configuration for
BLEUmetric. The computation of BLEU score is based on sacrebleu.- Parameters:
tokenizer (str) – The tokenizer to use. Defaults to sacrebleu.BLEU.TOKENIZER_DEFAULT. Available choices: Please refer to sacrebleu.BLEU.TOKENIZERS.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.BLEU(cfg)#
Bases:
MetricsBaseThe BLEU metric.
- class flexrag.metrics.Rouge#
Bases:
MetricsBaseThe Rouge metric. The computation of Rouge score is based on rouge. This metric will return the average of the Rouge-1, Rouge-2, and Rouge-L F1 scores.
- compute(**kwargs)#
Compute the metric value.
- Parameters:
questions (list[str], optional) – A list of questions. Defaults to None.
responses (list[str], optional) – A list of responses. Defaults to None.
golden_responses (list[list[str]], optional) – A list of golden responses. Defaults to None.
retrieved_contexts (list[list[str | RetrievedContext]], optional) – A list of retrieved contexts. Defaults to None.
golden_contexts (list[list[str]], optional) – A list of golden contexts. Defaults to None.
- Returns:
The metric scores and the metadata of the metric.
- Return type:
tuple[dict[str, float], dict]
- class flexrag.metrics.chrFConfig(chrf_beta=1.0, chrf_char_order=6, chrf_word_order=0)#
Configuration for
chrFmetric. The computation of chrF score is based on sacrebleu.- Parameters:
chrf_beta (float) – The beta value for the F-score. Defaults to 1.0.
chrf_char_order (int) – The order of characters. Defaults to sacrebleu.CHRF.CHAR_ORDER.
chrf_word_order (int) – The order of words. Defaults to sacrebleu.CHRF.WORD_ORDER.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.chrF(cfg)#
Bases:
MetricsBaseThe chrF metric.
- class flexrag.metrics.F1(cfg)[source]#
Bases:
MatchingMetricsF1 metric computes the F1 score of the predicted response against the golden responses.
- class flexrag.metrics.Accuracy(cfg)[source]#
Bases:
MatchingMetricsAccuracy metric computes if any of the golden responses is in the predicted response.
- class flexrag.metrics.ExactMatch(cfg)[source]#
Bases:
MatchingMetricsExactMatch metric computes if any of the golden responses is exactly the same as the predicted response.
Information Retrieval Metrics#
- class flexrag.metrics.SuccessRateConfig(eval_field=None, simplify=True)[source]#
Configuration for
SuccessRatemetric. This metric computes whether the retrieved contexts contain any of the golden responses.- Parameters:
eval_field (Optional[str]) – The field to evaluate. Defaults to None. If None, only strings are supported as the retrieved_contexts.
simplify (bool) – Whether to simplify the retrieved contexts. Defaults to True.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.SuccessRate(cfg)[source]#
Bases:
MetricsBaseThe SuccessRate metric computes whether the retrieved contexts contain any of the golden responses.
- class flexrag.metrics.RetrievalRecallConfig(k_values=<factory>)[source]#
Configuration for
RetrievalRecallmetric. This metric computes the recall of the retrieved contexts. The computation is based on pytrec_eval.- Parameters:
k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.RetrievalRecall(cfg)[source]#
Bases:
MetricsBaseThe RetrievalRecall metric computes the recall of the retrieved contexts.
- class flexrag.metrics.RetrievalPrecisionConfig(k_values=<factory>)[source]#
Configuration for
RetrievalPrecisionmetric. This metric computes the precision of the retrieved contexts. The computation is based on pytrec_eval.- Parameters:
k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.RetrievalPrecision(cfg)[source]#
Bases:
MetricsBaseThe RetrievalPrecision metric computes the precision of the retrieved contexts.
- class flexrag.metrics.RetrievalMAPConfig(k_values=<factory>)[source]#
Configuration for
RetrievalMAPmetric. This metric computes the MAP of the retrieved contexts. The computation is based on pytrec_eval.- Parameters:
k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.RetrievalMAP(cfg)[source]#
Bases:
MetricsBaseThe RetrievalMAP metric computes the Mean Average Precision (MAP) of the retrieved contexts.
- class flexrag.metrics.RetrievalNDCGConfig(k_values=<factory>)[source]#
Configuration for
RetrievalNDCGmetric. This metric computes the nDCG of the retrieved contexts. The computation is based on pytrec_eval.- Parameters:
k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.metrics.RetrievalNDCG(cfg)[source]#
Bases:
MetricsBaseThe RetrievalNDCG metric computes the Normalized Discounted Cumulative Gain (nDCG) of the retrieved contexts.