Metrics#

This module contains functions for evaluating the performance of a RAG assistant or a retriever.

class flexrag.metrics.MetricsBase[source]#
abstract compute(questions=None, responses=None, golden_responses=None, retrieved_contexts=None, golden_contexts=None)[source]#

Compute the metric value.

Parameters:
  • questions (list[str], optional) – A list of questions. Defaults to None.

  • responses (list[str], optional) – A list of responses. Defaults to None.

  • golden_responses (list[list[str]], optional) – A list of golden responses. Defaults to None.

  • retrieved_contexts (list[list[str | RetrievedContext]], optional) – A list of retrieved contexts. Defaults to None.

  • golden_contexts (list[list[str]], optional) – A list of golden contexts. Defaults to None.

Returns:

The metric scores and the metadata of the metric.

Return type:

tuple[dict[str, float], dict]

Helper Class#

The RAGEvaluator takes a list of metrics and evaluates the performance of a RAG assistant or a retriever.

class flexrag.metrics.EvaluatorConfig(metrics_type=<factory>, generation_bleu_config=<factory>, generation_chrf_config=<factory>, generation_em_config=<factory>, generation_accuracy_config=<factory>, generation_f1_config=<factory>, generation_recall_config=<factory>, generation_precision_config=<factory>, retrieval_success_rate_config=<factory>, retrieval_recall_config=<factory>, retrieval_precision_config=<factory>, retrieval_map_config=<factory>, retrieval_ndcg_config=<factory>, round=2)[source]#
dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.Evaluator(cfg)[source]#

Bases: object

evaluate(*, questions=None, responses=None, golden_responses=None, retrieved_contexts=None, golden_contexts=None, log=True)[source]#

Evaluate the generated responses against the ground truth responses.

Parameters:
  • questions (list[str], optional) – A list of questions. Defaults to None.

  • responses (list[str], optional) – A list of responses. Defaults to None.

  • golden_responses (list[list[str]], optional) – A list of golden responses. Defaults to None.

  • retrieved_contexts (list[list[str | RetrievedContext]], optional) – A list of retrieved contexts. Defaults to None.

  • golden_contexts (list[list[str]], optional) – A list of golden contexts. Defaults to None.

  • log (bool, optional) – Whether to log the evaluation results. Defaults to True.

Returns:

The evaluation results and the evaluation details.

Return type:

tuple[dict[str, float], dict[str, Any]]

RAG Generation Metrics#

class flexrag.metrics.BLEUConfig(tokenizer='13a')#

Configuration for BLEU metric. The computation of BLEU score is based on sacrebleu.

Parameters:

tokenizer (str) – The tokenizer to use. Defaults to sacrebleu.BLEU.TOKENIZER_DEFAULT. Available choices: Please refer to sacrebleu.BLEU.TOKENIZERS.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.BLEU(cfg)#

Bases: MetricsBase

The BLEU metric.

class flexrag.metrics.Rouge#

Bases: MetricsBase

The Rouge metric. The computation of Rouge score is based on rouge. This metric will return the average of the Rouge-1, Rouge-2, and Rouge-L F1 scores.

compute(**kwargs)#

Compute the metric value.

Parameters:
  • questions (list[str], optional) – A list of questions. Defaults to None.

  • responses (list[str], optional) – A list of responses. Defaults to None.

  • golden_responses (list[list[str]], optional) – A list of golden responses. Defaults to None.

  • retrieved_contexts (list[list[str | RetrievedContext]], optional) – A list of retrieved contexts. Defaults to None.

  • golden_contexts (list[list[str]], optional) – A list of golden contexts. Defaults to None.

Returns:

The metric scores and the metadata of the metric.

Return type:

tuple[dict[str, float], dict]

class flexrag.metrics.chrFConfig(chrf_beta=1.0, chrf_char_order=6, chrf_word_order=0)#

Configuration for chrF metric. The computation of chrF score is based on sacrebleu.

Parameters:
  • chrf_beta (float) – The beta value for the F-score. Defaults to 1.0.

  • chrf_char_order (int) – The order of characters. Defaults to sacrebleu.CHRF.CHAR_ORDER.

  • chrf_word_order (int) – The order of words. Defaults to sacrebleu.CHRF.WORD_ORDER.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.chrF(cfg)#

Bases: MetricsBase

The chrF metric.

class flexrag.metrics.F1(cfg)[source]#

Bases: MatchingMetrics

F1 metric computes the F1 score of the predicted response against the golden responses.

class flexrag.metrics.Accuracy(cfg)[source]#

Bases: MatchingMetrics

Accuracy metric computes if any of the golden responses is in the predicted response.

class flexrag.metrics.ExactMatch(cfg)[source]#

Bases: MatchingMetrics

ExactMatch metric computes if any of the golden responses is exactly the same as the predicted response.

class flexrag.metrics.Precision(cfg)[source]#

Bases: MatchingMetrics

Precision metric computes the precision of the predicted response against the golden responses.

class flexrag.metrics.Recall(cfg)[source]#

Bases: MatchingMetrics

Recall metric computes the recall of the predicted response against the golden responses.

Information Retrieval Metrics#

class flexrag.metrics.SuccessRateConfig(eval_field=None, simplify=True)[source]#

Configuration for SuccessRate metric. This metric computes whether the retrieved contexts contain any of the golden responses.

Parameters:
  • eval_field (Optional[str]) – The field to evaluate. Defaults to None. If None, only strings are supported as the retrieved_contexts.

  • simplify (bool) – Whether to simplify the retrieved contexts. Defaults to True.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.SuccessRate(cfg)[source]#

Bases: MetricsBase

The SuccessRate metric computes whether the retrieved contexts contain any of the golden responses.

class flexrag.metrics.RetrievalRecallConfig(k_values=<factory>)[source]#

Configuration for RetrievalRecall metric. This metric computes the recall of the retrieved contexts. The computation is based on pytrec_eval.

Parameters:

k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.RetrievalRecall(cfg)[source]#

Bases: MetricsBase

The RetrievalRecall metric computes the recall of the retrieved contexts.

class flexrag.metrics.RetrievalPrecisionConfig(k_values=<factory>)[source]#

Configuration for RetrievalPrecision metric. This metric computes the precision of the retrieved contexts. The computation is based on pytrec_eval.

Parameters:

k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.RetrievalPrecision(cfg)[source]#

Bases: MetricsBase

The RetrievalPrecision metric computes the precision of the retrieved contexts.

class flexrag.metrics.RetrievalMAPConfig(k_values=<factory>)[source]#

Configuration for RetrievalMAP metric. This metric computes the MAP of the retrieved contexts. The computation is based on pytrec_eval.

Parameters:

k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.RetrievalMAP(cfg)[source]#

Bases: MetricsBase

The RetrievalMAP metric computes the Mean Average Precision (MAP) of the retrieved contexts.

class flexrag.metrics.RetrievalNDCGConfig(k_values=<factory>)[source]#

Configuration for RetrievalNDCG metric. This metric computes the nDCG of the retrieved contexts. The computation is based on pytrec_eval.

Parameters:

k_values (list[int]) – The k values for evaluation. Defaults to [1, 5, 10].

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.metrics.RetrievalNDCG(cfg)[source]#

Bases: MetricsBase

The RetrievalNDCG metric computes the Normalized Discounted Cumulative Gain (nDCG) of the retrieved contexts.