Datasets

Datasets#

This module provides a set of classes and functions for loading and processing datasets.

class flexrag.datasets.IterableDataset[source]#

Bases: Iterable[ItemTypeI], Generic[ItemTypeI]

IterableDataset is a BaseClass for datasets that can be iterated over.

The subclasses of IterableDataset should implement the following methods:

>>> # return an iterator over the items in the dataset.
>>> def __iter__(self) -> Iterator[ItemTypeI]: ...

The following methods are implemented automatically:

>>> # concatenate multiple IterableDatasets.
>>> def __add__(self, other: IterableDataset[ItemTypeI]) -> IterableDataset[ItemTypeI]: ...

For example:

>>> class MyDataset(IterableDataset[int]):
...     def __init__(self, n: int):
...         self.n = n
...         return
...
...     def __iter__(self) -> Iterator[int]:
...         for i in range(self.n):
...             yield i
...
>>> dataset = MyDataset(3)
>>> # Iterate over the dataset.
>>> for item in dataset:
...     print(item)

class flexrag.datasets.MappingDataset[source]#

Bases: Generic[ItemTypeM]

MappingDataset is a BaseClass for datasets that can be indexed by integers.

The subclasses of MappingDataset should implement the following methods:

>>> # retrun the item at the given index.
>>> def __getitem__(self, index: int) -> ItemTypeM: ...
>>> # return the number of items in the dataset.
>>> def __len__(self) -> int: ...

The following methods are implemented automatically:

>>> # concatenate multiple MappingDatasets.
>>> def __add__(self, other: MappingDataset[ItemTypeM]) -> MappingDataset[ItemTypeM]: ...
>>> # return whether the dataset contains the given index.
>>> def __contains__(self, key: int) -> bool: ...
>>> # return an iterator over the items in the dataset.
>>> def __iter__(self) -> Iterator[ItemTypeM]: ...

For example:

>>> class MyDataset(MappingDataset[int]):
...     def __init__(self, n: int):
...         self.n = n
...         return
...
...     def __getitem__(self, index: int) -> int:
...         if 0 <= index < self.n:
...             return index
...         raise IndexError(f"Index {index} out of range.")
...
...     def __len__(self) -> int:
...         return self.n
...
>>> dataset = MyDataset(3)
>>> for i in range(len(dataset)):
...     print(dataset[i])

class flexrag.datasets.ChainDataset(*datasets)[source]#

Bases: IterableDataset[ItemTypeChain]

ChainDataset concatenates multiple IterableDatasets.

class flexrag.datasets.ConcatDataset(*datasets)[source]#

Bases: MappingDataset[ItemTypeConcat]

ConcatDataset concatenates multiple MappingDatasets.

class flexrag.datasets.HFDatasetConfig(path, name=None, data_dir=None, data_files=<factory>, split=None, cache_dir=None, token=None, trust_remote_code=False)[source]#

Bases: object

The configuration for the HFDataset. The HFDataset is a wrapper class that employs the load_dataset method in HuggingFace datasets library to load the dataset.

Parameters:

path (str) – Path or name of the dataset.
name (Optional[str]) – Defining the name of the dataset configuration.
data_dir (Optional[str]) – Defining the data_dir of the dataset configuration.
data_files (list[str]) – Paths to source data files.
split (Optional[str]) – Which split of the data to load.
cache_dir (Optional[str]) – Directory to read/write data.
token (Optional[str]) – Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
trust_remote_code (bool) – Whether or not to allow for datasets defined on the Hub using a dataset script.

For example, you can load the dataset from the HuggingFace by running the following code:

>>> cfg = HFDatasetConfig(
...     path="mteb/nq",
...     split="test",
... )
>>> dataset = HFDataset(cfg)

You can also load the dataset from a local repository by specifying the path:

>>> cfg = HFDatasetConfig(
...     path="json",
...     data_files=["path/to/local/my_dataset.json"],
... )
>>> dataset = HFDataset(cfg)

For more information about the parameters, please refer to the HuggingFace datasets documentation: https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.HFDataset(cfg)[source]#

Bases: MappingDataset

HFDataset is a dataset that wraps the HaggingFace datasets library.

class flexrag.datasets.LineDelimitedDatasetConfig(file_paths, data_ranges=<factory>, encoding='utf-8')[source]#

Bases: object

The configuration for LineDelimitedDataset.

Parameters:

file_paths (list[str]) – The paths to the line delimited files. It supports unix style path pattern.
data_ranges (list[list[int, int]]) – The data ranges to load from the files. The format is a list of [start_point, end_point] for each file. If end_point is -1, it will read to the end of the file. If not specified, it will read the whole file.
encoding (str) – The encoding of the files.

Example 1: Loading specific lines from given files.

>>> cfg = LineDelimitedDatasetConfig(
...     file_paths=["data1.jsonl", "data2.csv"],
...     data_ranges=[[0, 10], [0, 20]],
...     encoding="utf-8",
... )
>>> dataset = LineDelimitedDataset(cfg)
>>> items = [i for i in dataset]

Example 2: Loading multiple files using unix style path pattern.

>>> cfg = LineDelimitedDatasetConfig(
...     file_paths=["data/*.jsonl"],
...     encoding="utf-8",
... )
>>> dataset = LineDelimitedDataset(cfg)
>>> items = [i for i in dataset]

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.LineDelimitedDataset(cfg)[source]#

Bases: IterableDataset

The iterative dataset for loading line delimited files (csv, tsv, jsonl).

class flexrag.datasets.RAGEvalData(question, golden_contexts=None, golden_answers=None, meta_data=<factory>)[source]#

The dataclass for konwledge intensive QA task.

Parameters:

question (str) – The question for evaluation. Required.
golden_contexts (Optional[list[Context]]) – The contexts related to the question. Default: None.
golden_answers (Optional[list[str]]) – The golden answers for the question. Default: None.
meta_data (dict) – The metadata of the evaluation data. Default: {}.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.RAGEvalDatasetConfig(path='RUC-NLPIR/FlashRAG_datasets', name=None, data_dir=None, data_files=<factory>, split=None, cache_dir=None, token=None, trust_remote_code=False)[source]#

Bases: HFDatasetConfig

The configuration for RAGEvalDataset. This dataset helps to load the evaluation dataset collected by FlashRAG. The __iter__ method will yield RAGEvalData objects.

For example, you can load the test set of the NaturalQuestions dataset by running the following code:

from flexrag.datasets import RAGEvalDataset, RAGEvalDatasetConfig

cfg = RAGEvalDatasetConfig(
    name="nq",
    split="test",
)
dataset = RAGEvalDataset(cfg)

You can also load the dataset from a local repository by specifying the path. For example, you can download the dataset by running the following command:

>>> git lfs install
>>> git clone https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets flashrag

Then you can load the dataset by running the following code:

from flexrag.datasets import RAGEvalDataset, RAGEvalDatasetConfig

cfg = RAGEvalDatasetConfig(
    path="json",
    data_files=["flashrag/nq/test.jsonl"],
    split="train",
)
dataset = RAGEvalDataset(cfg)

Available datasets include:

2wikimultihopqa: dev, train

ambig_qa: dev, train

arc: dev, test, train

asqa: dev, train

ay2: dev, train

bamboogle: test

boolq: dev, train

commonsenseqa: dev, train

curatedtrec: test, train

eli5: dev, train

fermi: dev, test, train

fever: dev, train

hellaswag: dev, train

hotpotqa: dev, train

mmlu: 5_shot, dev, test, train

msmarco-qa: dev, train

musique: dev, train

narrativeqa: dev, test, train

nq: dev, test, train

openbookqa: dev, test, train

piqa: dev, train

popqa: test

quartz: dev, test, train

siqa: dev, train

squad: dev, train

t-rex: dev, train

triviaqa: dev, test, train

truthful_qa: dev

web_questions: test, train

wikiasp: dev, test, train

wikiqa: dev, test, train

wned: dev

wow: dev, train

zero-shot_re: dev, train

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.RAGEvalDataset(cfg)[source]#

Bases: HFDataset

The dataset for loading RAG evaluation data.

class flexrag.datasets.RAGCorpusDatasetConfig(file_paths, data_ranges=<factory>, encoding='utf-8', saving_fields=<factory>, id_field=None, processors=<factory>)[source]#

Bases: LineDelimitedDatasetConfig

The configuration for RAGCorpusDataset. This dataset helps to load the pre-processed corpus data for RAG retrieval. The __iter__ method will yield Context objects.

Parameters:

saving_fields (list[str]) – The fields to save in the context. If not specified, all fields will be saved.
id_field (Optional[str]) – The field to use as the context_id. If not specified, the ordinal number will be used.
processors (dict[str, TextProcessPipelineConfig]) – The preprocessors for each field. Default is {}. The key is the field name, and the value is the TextProcessPipelineConfig.

For example, to load the corpus provided by the Atlas, you can download the corpus by running the following command:

wget https://dl.fbaipublicfiles.com/atlas/corpora/wiki/enwiki-dec2021/text-list-100-sec.jsonl
wget https://dl.fbaipublicfiles.com/atlas/corpora/wiki/enwiki-dec2021/infobox.jsonl

Then you can use the following code to load the corpus with a length filter:

from flexrag.datasets import RAGCorpusDataset, RAGCorpusDatasetConfig
from flexrag.text_process import TextProcessPipelineConfig, LengthFilterConfig

cfg = RAGCorpusDatasetConfig(
    file_paths=[
        "/data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/wikipedia/wiki_2021/infobox.jsonl",
        "/data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/wikipedia/wiki_2021/text-list-100-sec.jsonl",
    ],
    saving_fields=["title", "text"],
    processors={
        "text": TextProcessPipelineConfig(
            processor_type=["length_filter"],
            length_filter_config=LengthFilterConfig(
                max_chars=4096,
                min_chars=10,
            ),
        )
    },
    encoding="utf-8",
)
dataset = RAGCorpusDataset(cfg)

The above code will load the corpus data from the provided files and preprocess the text field with a length filter. For any text with a length less than 10 or greater than 4096 characters, it will be filtered out.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.RAGCorpusDataset(cfg)[source]#

Bases: LineDelimitedDataset

The dataset for loading pre-processed corpus data for RAG retrieval.

class flexrag.datasets.IREvalData(question, contexts=None, meta_data=<factory>)[source]#

The dataclass for Information Retrieval evaluation data.

Parameters:

question (str) – The question for evaluation. Required.
contexts (Optional[list[Context]]) – The contexts related to the question. Default: None.
meta_data (dict) – The metadata of the evaluation data. Default: {}.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.MTEBDatasetConfig(data_path, subset, encoding='utf-8', load_corpus=False)[source]#

Bases: object

Configuration for loading MTEB Retrieval Dataset. The __getitem__ method will return IREvalData objects.

For example, to load the NQ dataset, you can download the test set by running the following command:

>>> git lfs install
>>> git clone https://huggingface.co/datasets/mteb/nq nq

Then you can use the following code to load the dataset:

>>> config = MTEBDatasetConfig(
...     data_path="nq",
...     subset="test",
...     load_corpus=False,
... )
>>> dataset = MTEBDataset(config)

Parameters:

data_path (str) – Path to the data directory. Required.
subset (str) – Subset of the dataset to load. Required.
encoding (str) – Encoding of the data files. Default is ‘utf-8’.
load_corpus (bool) – Whether to load the corpus data. Default is False.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.datasets.MTEBDataset(config)[source]#

Bases: MappingDataset[IREvalData]

Dataset for loading MTEB Retrieval Dataset.

Datasets

Contents

Datasets#