Datasets#
This module provides a set of classes and functions for loading and processing datasets.
- class flexrag.datasets.IterableDataset[source]#
Bases:
Iterable[ItemTypeI],Generic[ItemTypeI]IterableDataset is a BaseClass for datasets that can be iterated over.
The subclasses of IterableDataset should implement the following methods:
>>> # return an iterator over the items in the dataset. >>> def __iter__(self) -> Iterator[ItemTypeI]: ...
The following methods are implemented automatically:
>>> # concatenate multiple IterableDatasets. >>> def __add__(self, other: IterableDataset[ItemTypeI]) -> IterableDataset[ItemTypeI]: ...
For example:
>>> class MyDataset(IterableDataset[int]): ... def __init__(self, n: int): ... self.n = n ... return ... ... def __iter__(self) -> Iterator[int]: ... for i in range(self.n): ... yield i ... >>> dataset = MyDataset(3) >>> # Iterate over the dataset. >>> for item in dataset: ... print(item)
- class flexrag.datasets.MappingDataset[source]#
Bases:
Generic[ItemTypeM]MappingDataset is a BaseClass for datasets that can be indexed by integers.
The subclasses of MappingDataset should implement the following methods:
>>> # retrun the item at the given index. >>> def __getitem__(self, index: int) -> ItemTypeM: ... >>> # return the number of items in the dataset. >>> def __len__(self) -> int: ...
The following methods are implemented automatically:
>>> # concatenate multiple MappingDatasets. >>> def __add__(self, other: MappingDataset[ItemTypeM]) -> MappingDataset[ItemTypeM]: ... >>> # return whether the dataset contains the given index. >>> def __contains__(self, key: int) -> bool: ... >>> # return an iterator over the items in the dataset. >>> def __iter__(self) -> Iterator[ItemTypeM]: ...
For example:
>>> class MyDataset(MappingDataset[int]): ... def __init__(self, n: int): ... self.n = n ... return ... ... def __getitem__(self, index: int) -> int: ... if 0 <= index < self.n: ... return index ... raise IndexError(f"Index {index} out of range.") ... ... def __len__(self) -> int: ... return self.n ... >>> dataset = MyDataset(3) >>> for i in range(len(dataset)): ... print(dataset[i])
- class flexrag.datasets.ChainDataset(*datasets)[source]#
Bases:
IterableDataset[ItemTypeChain]ChainDataset concatenates multiple IterableDatasets.
- class flexrag.datasets.ConcatDataset(*datasets)[source]#
Bases:
MappingDataset[ItemTypeConcat]ConcatDataset concatenates multiple MappingDatasets.
- class flexrag.datasets.HFDatasetConfig(path, name=None, data_dir=None, data_files=<factory>, split=None, cache_dir=None, token=None, trust_remote_code=False)[source]#
Bases:
objectThe configuration for the
HFDataset. TheHFDatasetis a wrapper class that employs theload_datasetmethod in HuggingFacedatasetslibrary to load the dataset.- Parameters:
path (str) – Path or name of the dataset.
name (Optional[str]) – Defining the name of the dataset configuration.
data_dir (Optional[str]) – Defining the
data_dirof the dataset configuration.data_files (list[str]) – Paths to source data files.
split (Optional[str]) – Which split of the data to load.
cache_dir (Optional[str]) – Directory to read/write data.
token (Optional[str]) – Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
trust_remote_code (bool) – Whether or not to allow for datasets defined on the Hub using a dataset script.
For example, you can load the dataset from the HuggingFace by running the following code:
>>> cfg = HFDatasetConfig( ... path="mteb/nq", ... split="test", ... ) >>> dataset = HFDataset(cfg)
You can also load the dataset from a local repository by specifying the path:
>>> cfg = HFDatasetConfig( ... path="json", ... data_files=["path/to/local/my_dataset.json"], ... ) >>> dataset = HFDataset(cfg)
For more information about the parameters, please refer to the HuggingFace
datasetsdocumentation: https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.HFDataset(cfg)[source]#
Bases:
MappingDatasetHFDataset is a dataset that wraps the HaggingFace
datasetslibrary.
- class flexrag.datasets.LineDelimitedDatasetConfig(file_paths, data_ranges=<factory>, encoding='utf-8')[source]#
Bases:
objectThe configuration for
LineDelimitedDataset.- Parameters:
file_paths (list[str]) – The paths to the line delimited files. It supports unix style path pattern.
data_ranges (list[list[int, int]]) – The data ranges to load from the files. The format is a list of [start_point, end_point] for each file. If end_point is -1, it will read to the end of the file. If not specified, it will read the whole file.
encoding (str) – The encoding of the files.
Example 1: Loading specific lines from given files.
>>> cfg = LineDelimitedDatasetConfig( ... file_paths=["data1.jsonl", "data2.csv"], ... data_ranges=[[0, 10], [0, 20]], ... encoding="utf-8", ... ) >>> dataset = LineDelimitedDataset(cfg) >>> items = [i for i in dataset]
Example 2: Loading multiple files using unix style path pattern.
>>> cfg = LineDelimitedDatasetConfig( ... file_paths=["data/*.jsonl"], ... encoding="utf-8", ... ) >>> dataset = LineDelimitedDataset(cfg) >>> items = [i for i in dataset]
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.LineDelimitedDataset(cfg)[source]#
Bases:
IterableDatasetThe iterative dataset for loading line delimited files (csv, tsv, jsonl).
- class flexrag.datasets.RAGEvalData(question, golden_contexts=None, golden_answers=None, meta_data=<factory>)[source]#
The dataclass for konwledge intensive QA task.
- Parameters:
question (str) – The question for evaluation. Required.
golden_contexts (Optional[list[Context]]) – The contexts related to the question. Default: None.
golden_answers (Optional[list[str]]) – The golden answers for the question. Default: None.
meta_data (dict) – The metadata of the evaluation data. Default: {}.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.RAGEvalDatasetConfig(path='RUC-NLPIR/FlashRAG_datasets', name=None, data_dir=None, data_files=<factory>, split=None, cache_dir=None, token=None, trust_remote_code=False)[source]#
Bases:
HFDatasetConfigThe configuration for
RAGEvalDataset. This dataset helps to load the evaluation dataset collected by FlashRAG. The__iter__method will yield RAGEvalData objects.For example, you can load the test set of the NaturalQuestions dataset by running the following code:
from flexrag.datasets import RAGEvalDataset, RAGEvalDatasetConfig cfg = RAGEvalDatasetConfig( name="nq", split="test", ) dataset = RAGEvalDataset(cfg)
You can also load the dataset from a local repository by specifying the path. For example, you can download the dataset by running the following command:
>>> git lfs install >>> git clone https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets flashrag
Then you can load the dataset by running the following code:
from flexrag.datasets import RAGEvalDataset, RAGEvalDatasetConfig cfg = RAGEvalDatasetConfig( path="json", data_files=["flashrag/nq/test.jsonl"], split="train", ) dataset = RAGEvalDataset(cfg)
Available datasets include:
2wikimultihopqa: dev, train
ambig_qa: dev, train
arc: dev, test, train
asqa: dev, train
ay2: dev, train
bamboogle: test
boolq: dev, train
commonsenseqa: dev, train
curatedtrec: test, train
eli5: dev, train
fermi: dev, test, train
fever: dev, train
hellaswag: dev, train
hotpotqa: dev, train
mmlu: 5_shot, dev, test, train
msmarco-qa: dev, train
musique: dev, train
narrativeqa: dev, test, train
nq: dev, test, train
openbookqa: dev, test, train
piqa: dev, train
popqa: test
quartz: dev, test, train
siqa: dev, train
squad: dev, train
t-rex: dev, train
triviaqa: dev, test, train
truthful_qa: dev
web_questions: test, train
wikiasp: dev, test, train
wikiqa: dev, test, train
wned: dev
wow: dev, train
zero-shot_re: dev, train
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.RAGEvalDataset(cfg)[source]#
Bases:
HFDatasetThe dataset for loading RAG evaluation data.
- class flexrag.datasets.RAGCorpusDatasetConfig(file_paths, data_ranges=<factory>, encoding='utf-8', saving_fields=<factory>, id_field=None, processors=<factory>)[source]#
Bases:
LineDelimitedDatasetConfigThe configuration for
RAGCorpusDataset. This dataset helps to load the pre-processed corpus data for RAG retrieval. The__iter__method will yield Context objects.- Parameters:
saving_fields (list[str]) – The fields to save in the context. If not specified, all fields will be saved.
id_field (Optional[str]) – The field to use as the context_id. If not specified, the ordinal number will be used.
processors (dict[str, TextProcessPipelineConfig]) – The preprocessors for each field. Default is {}. The key is the field name, and the value is the TextProcessPipelineConfig.
For example, to load the corpus provided by the Atlas, you can download the corpus by running the following command:
wget https://dl.fbaipublicfiles.com/atlas/corpora/wiki/enwiki-dec2021/text-list-100-sec.jsonl wget https://dl.fbaipublicfiles.com/atlas/corpora/wiki/enwiki-dec2021/infobox.jsonl
Then you can use the following code to load the corpus with a length filter:
from flexrag.datasets import RAGCorpusDataset, RAGCorpusDatasetConfig from flexrag.text_process import TextProcessPipelineConfig, LengthFilterConfig cfg = RAGCorpusDatasetConfig( file_paths=[ "/data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/wikipedia/wiki_2021/infobox.jsonl", "/data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/wikipedia/wiki_2021/text-list-100-sec.jsonl", ], saving_fields=["title", "text"], processors={ "text": TextProcessPipelineConfig( processor_type=["length_filter"], length_filter_config=LengthFilterConfig( max_chars=4096, min_chars=10, ), ) }, encoding="utf-8", ) dataset = RAGCorpusDataset(cfg)
The above code will load the corpus data from the provided files and preprocess the text field with a length filter. For any text with a length less than 10 or greater than 4096 characters, it will be filtered out.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.RAGCorpusDataset(cfg)[source]#
Bases:
LineDelimitedDatasetThe dataset for loading pre-processed corpus data for RAG retrieval.
- class flexrag.datasets.IREvalData(question, contexts=None, meta_data=<factory>)[source]#
The dataclass for Information Retrieval evaluation data.
- Parameters:
question (str) – The question for evaluation. Required.
contexts (Optional[list[Context]]) – The contexts related to the question. Default: None.
meta_data (dict) – The metadata of the evaluation data. Default: {}.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.MTEBDatasetConfig(data_path, subset, encoding='utf-8', load_corpus=False)[source]#
Bases:
objectConfiguration for loading MTEB Retrieval Dataset. The __getitem__ method will return IREvalData objects.
For example, to load the NQ dataset, you can download the test set by running the following command:
>>> git lfs install >>> git clone https://huggingface.co/datasets/mteb/nq nq
Then you can use the following code to load the dataset:
>>> config = MTEBDatasetConfig( ... data_path="nq", ... subset="test", ... load_corpus=False, ... ) >>> dataset = MTEBDataset(config)
- Parameters:
data_path (str) – Path to the data directory. Required.
subset (str) – Subset of the dataset to load. Required.
encoding (str) – Encoding of the data files. Default is ‘utf-8’.
load_corpus (bool) – Whether to load the corpus data. Default is False.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.datasets.MTEBDataset(config)[source]#
Bases:
MappingDataset[IREvalData]Dataset for loading MTEB Retrieval Dataset.