Retrievers

Retrievers#

Retrievers are used to retrieve data from the local knowledge base or the web.

The Retriever Interface#

RetrieverBase is the base class for all retrievers, including the subclasses of EditableRetriever and WebRetrieverBase.

class flexrag.retriever.RetrieverBaseConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>)[source]#

Base configuration class for all retrievers.

Parameters:

log_interval (int) – The interval of logging. Default: 100.
top_k (int) – The number of retrieved documents. Default: 10.
batch_size (int) – The batch size for retrieval. Default: 32.
query_preprocess_pipeline (TextProcessPipelineConfig) – The text process pipeline for query. Default: TextProcessPipelineConfig.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.RetrieverBase(cfg)[source]#

The base class for all retrievers. The subclasses should implement the search method and the fields property.

async async_search(query, **search_kwargs)[source]#: Search queries asynchronously.

abstract property fields#: The fields of the retrieved data.

abstract search(query, **search_kwargs)[source]#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

test_speed(sample_num=10000, test_times=10, **search_kwargs)[source]#

Test the speed of the retriever.

Parameters:

sample_num (int, optional) – The number of samples to test.
test_times (int, optional) – The number of times to test.

Returns:

The time consumed for retrieval.

Return type:

float

class flexrag.retriever.RetrieverConfig(retriever_type=None, elastic_config=<factory>, flex_config=<factory>, hyde_config=<factory>, typesense_config=<factory>, simple_web_config=<factory>, wikipedia_config=<factory>)#

Configuration class for retriever (name: RetrieverConfig, default: None).

Parameters:

retriever_type (str) – The retriever type to use.
elastic_config (ElasticRetrieverConfig) – The config for ElasticRetriever.
flex_config (FlexRetrieverConfig) – The config for FlexRetriever.
hyde_config (HydeRetrieverConfig) – The config for HydeRetriever.
typesense_config (TypesenseRetrieverConfig) – The config for TypesenseRetriever.
simple_web_config (SimpleWebRetrieverConfig) – The config for SimpleWebRetriever.
wikipedia_config (WikipediaRetrieverConfig) – The config for WikipediaRetriever.

RetrieverConfig is the general configuration for all registered retrievers. You can load any retriever by specifying the retriever name in the configuration. For example, to load the pre-built FlexRetriever retriever, you can use the following configuration:

from flexrag.retriever import RetrieverConfig, RETRIEVERS, FlexRetrieverConfig

config = RetrieverConfig(
    retriever_type='flex',
    flex_config=FlexRetrieverConfig(
        retriever_path='<path_to_retriever>',
    )
)
retriever = RETRIEVERS.load(config)

Editable Retriever#

class flexrag.retriever.EditableRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>)[source]#

Configuration class for LocalRetriever.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.EditableRetriever(cfg)[source]#

Bases: RetrieverBase

The base class for all editable retrievers. In FlexRAG, the EditableRetriever is a concept referring to a retriever that includes the add_passages and clear methods, allowing you to build the retriever using your own knowledge base. FlexRAG provides following editable retrievers: FlexRetriever, ElasticRetriever, TypesenseRetriever, and HydeRetriever. The subclasses should implement the add_passages, clear, and __len__ methods.

abstract add_passages(passages)[source]#

Add passages to the retriever database.

Parameters:: passages (Iterable[Context]) – The passages to add.
Returns:: None

abstract clear()[source]#: Clear the retriever database.

class flexrag.retriever.ElasticRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, host='http://localhost:9200', api_key=None, index_name=None, custom_properties=None, verbose=False, retry_times=3, retry_delay=0.5)[source]#

Configuration class for ElasticRetriever.

Parameters:

host (str) – Host of the ElasticSearch server. Default: “http://localhost:9200”.
api_key (Optional[str]) – API key for the ElasticSearch server. Default: None.
index_name (str) – Name of the index. Required.
custom_properties (Optional[dict]) – Custom properties for building the index. Default: None.
verbose (bool) – Enable verbose logging mode. Default: False.
retry_times (int) – Number of retry times. Default: 3.
retry_delay (float) – Delay time for retry. Default: 0.5.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.ElasticRetriever(cfg)[source]#

Bases: EditableRetriever

add_passages(**kwargs)#

Add passages to the retriever database.

Parameters:: passages (Iterable[Context]) – The passages to add.
Returns:: None

clear()[source]#: Clear the retriever database.

property fields#: The fields of the retrieved data.

search(**kwargs)#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

class flexrag.retriever.TypesenseRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, host='localhost', port=8108, protocol='http', api_key=None, index_name=None, timeout=200.0)[source]#

Configuration class for TypesenseRetriever.

Parameters:

host (str) – Host of the Typesense server. Default: “localhost”.
port (int) – Port of the Typesense server. Default: 8108.
protocol (str) – Protocol of the Typesense server. Default: “http”. Available options: “https”, “http”.
api_key (str) – API key for the Typesense server. Required.
index_name (str) – Name of the Typesense collection. Required.
timeout (float) – Timeout for the connection. Default: 200.0.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.TypesenseRetriever(cfg)[source]#

Bases: EditableRetriever

add_passages(**kwargs)#

Add passages to the retriever database.

Parameters:: passages (Iterable[Context]) – The passages to add.
Returns:: None

clear()[source]#: Clear the retriever database.

property fields#: The fields of the retrieved data.

search(**kwargs)#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

class flexrag.retriever.LocalRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None)[source]#

The configuration class for LocalRetriever.

Parameters:: retriever_path (Optional[str]) – The path to the local database. Default: None. If specified, all modifications to the retriever will be applied simultaneously on the disk. If not specified, the retriever will be kept in memory.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.LocalRetriever(cfg)[source]#

Bases: EditableRetriever

The base class for all local retrievers.

In FlexRAG, the LocalRetriever is a concept referring to a retriever that can be saved to the local disk. The subclasses provide the save_to_local and load_from_local methods to save and load the retriever from the local disk, and the save_to_hub and load_from_hub methods to save and load the retriever from the HuggingFace Hub.

FlexRAG provides following local retrievers: FlexRetriever, and HydeRetriever.

For example, to load a retriever hosted on the HuggingFace Hub, you can run the following code:

from flexrag.retriever import LocalRetriever

retriever = LocalRetriever.load_from_hub("flexrag/wiki2021_atlas_bm25s")

To save a retriever to the HuggingFace Hub, you can run the following code:

retriever.save_to_hub("<your-repo-id>", token="<your-token>")

abstract detach()[source]#: Detach the retriever from the local database. After detaching, the retriever will be kept in memory and all modifications will not be applied to the disk.

static load_from_hub(repo_id, revision=None, token=None, cache_dir='/home/docs/.cache/flexrag', **kwargs)[source]#

Load a retriever from the HuggingFace Hub.

Parameters:

repo_id (str) – The repo id of the retriever on the HuggingFace Hub.
revision (str) – The revision of the retriever on the HuggingFace Hub. Default: None.
token (str) – The token to access the HuggingFace Hub. Default: None.
cache_dir (str) – The cache directory to store the retriever. Default: FLEXRAG_CACHE_DIR.
kwargs (Any) – Additional arguments for the retriever.

Returns:

The loaded retriever.

Return type:

LocalRetriever

static load_from_local(repo_path=None, **kwargs)[source]#

Load a retriever from the local disk.

Parameters:: repo_path (str) – The path to the local database. Default: None.
Returns:: The loaded retriever.
Return type:: LocalRetriever

save_to_hub(repo_id, token=None, commit_message='Update FlexRAG retriever', retriever_card=None, private=False, **kwargs)[source]#

Save the retriever to the HuggingFace Hub.

Parameters:

repo_id (str) – The repo id of the retriever on the HuggingFace Hub.
token (str) – The token to access the HuggingFace Hub. Default: None.
commit_message (str) – The commit message for the retriever. Default: “Update FlexRAG retriever”.
retriever_card (str) – The markdown readme file for the retriever. Default: None.
private (bool) – Whether to create a private repo. Default: False.
kwargs (Any) – Additional arguments for uploading the retriever.

Returns:

The repo url of the retriever.

Return type:

str

abstract save_to_local(retriever_path=None)[source]#

Save the retriever to the local disk.

Parameters:: retriever_path (str) – The path to the local database. Default: None.
Returns:: None
Return type:: None

class flexrag.retriever.FlexRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None, indexes_merge_method='rrf', indexes_merge_weights=None, used_indexes=None, rrf_base=60)[source]#

Configuration class for FlexRetriever.

Parameters:

indexes_merge_method (str) – Method to merge the scores of multiple indexes. Available choices are “rrf” and “linear”. Default is “rrf”. * “rrf”: Reciprocal Rank Fusion (RRF) method. * “linear”: Linear combination of the scores.
merge_weights (Optional[list[float]]) – List of weights for each index. Default is None. If None, all indexes will be treated equally. This option is used in both “rrf” and “linear” methods.
used_indexes (Optional[list[str]]) – List of indexes to use for retrieval. Default is None. If None, all indexes will be used.
rrf_base (int) – Base for the RRF method. Default is 60. This option is only used when indexes_merge_method is “rrf”.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.FlexRetriever(cfg)[source]#

Bases: LocalRetriever

FlexRetriever is a retriever implemented by FlexRAG team. FlexRetriever supports multi-index and multi-field retrieval.

add_passages(**kwargs)#

Add passages to the retriever database.

Parameters:: passages (Iterable[Context]) – The passages to add.
Returns:: None

clear()[source]#: Clear the retriever database.

detach()[source]#: Detach the retriever from the local disk to memory. This function will not delete the database or the indexes.

property fields#: The fields of the retrieved data.

remove_index(index_name)[source]#

Remove an index from the retriever.

Parameters:: index_name (str) – Name of the index.
Raises:: ValueError – If the index name does not exist.
Returns:: None
Return type:: None

save_to_local(retriever_path=None)[source]#

Save the retriever to the local disk.

Parameters:: retriever_path (str) – The path to the local database. Default: None.
Returns:: None
Return type:: None

search(**kwargs)#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

class flexrag.retriever.HydeRetrieverConfig(generator_type=None, anthropic_config=<factory>, hf_config=<factory>, hf_vlm_config=<factory>, ollama_config=<factory>, openai_config=<factory>, vllm_config=<factory>, log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None, indexes_merge_method='rrf', indexes_merge_weights=None, used_indexes=None, rrf_base=60, task='WEB_SEARCH', language='en')#

Configuration class for HydeRetriever.

Parameters:

task (str) – Task for rewriting the query. Default: “WEB_SEARCH”. Available options: “WEB_SEARCH”, “SCIFACT”, “ARGUANA”, “TREC_COVID”, “FIQA”, “DBPEDIA_ENTITY”, “TREC_NEWS”, “MR_TYDI”.
language (str) – Language for rewriting. Default: “en”.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.HydeRetriever(cfg, no_check=False)#

Bases: FlexRetriever

HydeRetriever is a retriever that rewrites the query before searching.

The original paper is available at https://aclanthology.org/2023.acl-long.99/.

search(query, **search_kwargs)#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

Retriever Index#

RetrieverIndex is used in FlexRetriever to store and retrieve dense embeddings.

class flexrag.retriever.index.RetrieverIndexBase[source]#

The base class for all retriever indexes. This class provides the basic interface for building, adding, and searching the index.

The subclass should implement the following methods: - build_index: Build the index from the data. - insert: Add a batch of data to the index. - search: Search for the top_k most similar data indices to the query. - serialize: Serialize the index to the disk. - clear: Clear the index and remove the serialized index files. - __len__: Return the number of data in the index. - is_addable: Return whether the index is addable.

abstract build_index(data)[source]#

Build the index. The index will be serialized automatically if the index_path is set.

Parameters:: data (Iterable[Any]) – The data to build the index.
Returns:: None

abstract clear()[source]#: Reset the index and remove the serialized index files.

abstract property infimum#: Return the infimum of the similarity scores for the index.

abstract insert(data, serialize=True)[source]#

Add a batch of data to the index.

Parameters:

data (list[Any]) – The data to add.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.

Returns:

None

insert_batch(data, batch_size=None, serialize=True)[source]#

Add data to the index in batches. This method will automatically perform the serialize method if the index_path is set.

Parameters:

data (Iterable[Any]) – The data to add.
batch_size (int) – The batch size to add data to the index. Defaults to self.batch_size.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.

Returns:

None

static load_from_local(index_path)[source]#

Load the index from the local path.

Parameters:: index_path (str) – The path to load the index.

abstract save_to_local(index_path=None)[source]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

Parameters:: index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.

abstract search(query, top_k, **search_kwargs)[source]#

Search for the top_k most similar data indices to the query.

Parameters:

query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.

Returns:

The indices and scores of the top_k most similar data indices.

Return type:

tuple[np.ndarray, np.ndarray]

abstract property supremum#: Return the supremum of the similarity scores for the index.

class flexrag.retriever.index.RetrieverIndexBaseConfig(log_interval=10000, batch_size=512, index_path=None)[source]#

The configuration for the RetrieverIndexBase.

Log_interval:: The interval to log the progress. Defaults to 10000.
Batch_size:: The batch size to add data to the index. Defaults to 512.
Parameters:: index_path (Optional[str]) – The path to save the index. If not specified, the index will be kept in memory. Defaults to None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.RetrieverIndexConfig(index_type='faiss', bm25_config=<factory>, faiss_config=<factory>, scann_config=<factory>)#

Configuration class for index (name: RetrieverIndexConfig, default: faiss).

Parameters:

index_type (str) – The index type to use.
bm25_config (BM25IndexConfig) – The config for BM25Index.
faiss_config (FaissIndexConfig) – The config for FaissIndex.
scann_config (ScaNNIndexConfig) – The config for ScaNNIndex.

RetrieverConfig is the general configuration for all registered RetrieverIndexes. You can load any RetrieverIndex by specifying the index_type in the configuration. For example, to load the BM25Index, you can use the following configuration:

from flexrag.retriever.index import RetrieverIndexConfig, RETRIEVER_INDEX, BM25IndexConfig

config = RetrieverIndexConfig(
    index_type='bm25',
    bm25_config=BM25IndexConfig(
        index_path='<path_to_index>',
    )
)
index = RETRIEVER_INDEX.load(config)

class flexrag.retriever.index.FaissIndexConfig(log_interval=10000, batch_size=512, index_path=None, query_encoder_config=<factory>, passage_encoder_config=<factory>, distance_function='IP', index_type='auto', n_subquantizers=8, n_bits=8, n_list=1000, factory_str=None, index_train_num=-1, n_probe=None, device_id=<factory>, k_factor=10, polysemous_ht=0, efSearch=100)[source]#

The configuration for the FaissIndex.

Parameters:

index_type (str) – Building param: the type of the index. Defaults to “auto”. available choices are “FLAT”, “IVF”, “PQ”, “IVFPQ”, and “auto”. If set to “auto”, the index will be set to “IVF{n_list},PQ{embedding_size//2}x4fs”.
n_subquantizers (int) – Building param: the number of subquantizers. Defaults to 8. This parameter is only used when the index type is “PQ” or “IVFPQ”.
n_bits (int) – Building param: the number of bits per subquantizer. Defaults to 8. This parameter is only used when the index type is “PQ” or “IVFPQ”.
n_list (int) – Building param: the number of cells. Defaults to 1000. This parameter is only used when the index type is “IVF” or “IVFPQ”.
factory_str (Optional[str]) – Building param: the factory string to build the index. Defaults to None. If set, the index_type will be ignored.
index_train_num (int) – Building param: the number of data used to train the index. Defaults to -1. If set to -1, all data will be used to train the index.
n_probe (Optional[int]) – Inference param: the number of probes. Defaults to None. If not set, the number of probes will be set to n_list // 8. This parameter is only used when the index type is “IVF” or “IVFPQ”.
device_id (list[int]) – Inference param: the device(s) to use. Defaults to []. [] means CPU. If set, the index will be accelerated with GPU.
k_factor (int) – Inference param: the k factor for search. Defaults to 10.
polysemous_ht (int) – Inference param: the polysemous hash table. Defaults to 0.
efSearch (int) – Inference param: the efSearch for HNSW. Defaults to 100.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.FaissIndex(cfg)[source]#

Bases: DenseIndexBase

FaissIndex employs faiss library to build and search indexes with embeddings. FaissIndex supports both CPU and GPU acceleration. FaissIndex supports various index types, including FLAT, IVF, PQ, IVFPQ, and auto. FaissIndex provides a flexible and efficient way to build and search indexes with embeddings.

add_embeddings(embeddings)[source]#

A helper function that adds embeddings to the index.

Parameters:: embeds (np.ndarray) – The embeddings to add.
Returns:: None

build_index(data)[source]#

Build the index. The index will be serialized automatically if the index_path is set.

Parameters:: data (Iterable[Any]) – The data to build the index.
Returns:: None

clear()[source]#: Reset the index and remove the serialized index files.

property embedding_size#: Return the embedding size of the index.

save_to_local(index_path=None)[source]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

Parameters:: index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.

search(query, top_k, **search_kwargs)[source]#

Search for the top_k most similar data indices to the query.

Parameters:

query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.

Returns:

The indices and scores of the top_k most similar data indices.

Return type:

tuple[np.ndarray, np.ndarray]

class flexrag.retriever.index.ScaNNIndexConfig(log_interval=10000, batch_size=512, index_path=None, query_encoder_config=<factory>, passage_encoder_config=<factory>, distance_function='IP', num_leaves=2000, num_leaves_to_search=500, num_neighbors=10, anisotropic_quantization_threshold=0.2, dimensions_per_block=2, threads=0, index_train_num=0)[source]#

The configuration for the ScaNNIndex.

Parameters:

num_leaves (int) – The number of leaves in the tree. Defaults to 2000.
num_leaves_to_search (int) – The number of leaves to search. Defaults to 500.
num_neighbors (int) – The number of neighbors to search. Defaults to 10.
anisotropic_quantization_threshold (float) – The anisotropic quantization threshold. Defaults to 0.2.
dimensions_per_block (int) – The number of dimensions per block. Defaults to 2.
threads (int) – The number of threads to use. Defaults to 0 (auto).
index_train_num (int) – The number of samples to train the index. Defaults to 0 (all).

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.ScaNNIndex(cfg)[source]#

Bases: DenseIndexBase

ScaNNIndex is a wrapper for the ScaNN library.

ScaNNIndex runs on CPUs with both high speed and accuracy. However, it requires more memory than FaissIndex.

add_embeddings(embeddings)[source]#

A helper function that adds embeddings to the index.

Parameters:: embeds (np.ndarray) – The embeddings to add.
Returns:: None

build_index(data)[source]#

Build the index. The index will be serialized automatically if the index_path is set.

Parameters:: data (Iterable[Any]) – The data to build the index.
Returns:: None

clear()[source]#: Reset the index and remove the serialized index files.

property embedding_size#: Return the embedding size of the index.

save_to_local(index_path=None)[source]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

Parameters:: index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.

search(query, top_k, **search_kwargs)[source]#

Search for the top_k most similar data indices to the query.

Parameters:

query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.

Returns:

The indices and scores of the top_k most similar data indices.

Return type:

tuple[np.ndarray, np.ndarray]

class flexrag.retriever.index.BM25IndexConfig(log_interval=10000, batch_size=512, index_path=None, method='lucene', idf_method=None, backend='auto', k1=1.5, b=0.75, delta=0.5, lang='english')[source]#

Configuration class for BM25Index.

Parameters:

method (str) – BM25S method. Default: “lucene”. Available options: “atire”, “bm25l”, “bm25+”, “lucene”, “robertson”.
idf_method (Optional[str]) – IDF method. Default: None. Available options: “atire”, “bm25l”, “bm25+”, “lucene”, “robertson”.
backend (str) – Backend for BM25S. Default: “auto”. Available options: “numpy”, “numba”, “auto”.
k1 (float) – BM25S parameter k1. Default: 1.5.
b (float) – BM25S parameter b. Default: 0.75.
delta (float) – BM25S parameter delta. Default: 0.5.
lang (str) – Language for Tokenization. Default: “english”.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.BM25Index(cfg)[source]#

Bases: RetrieverIndexBase

BM25Index is a index that retrieves passages using the BM25 algorithm. The implementation is based on the bm25s project.

build_index(data)[source]#

Build the index. The index will be serialized automatically if the index_path is set.

Parameters:: data (Iterable[Any]) – The data to build the index.
Returns:: None

clear()[source]#: Reset the index and remove the serialized index files.

property infimum#: Return the infimum of the similarity scores for the index.

insert(data)[source]#

Add a batch of data to the index.

Parameters:

data (list[Any]) – The data to add.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.

Returns:

None

save_to_local(index_path=None)[source]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

Parameters:: index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.

search(query, top_k, **search_kwargs)[source]#

Search for the top_k most similar data indices to the query.

Parameters:

query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.

Returns:

The indices and scores of the top_k most similar data indices.

Return type:

tuple[np.ndarray, np.ndarray]

property supremum#: Return the supremum of the similarity scores for the index.

class flexrag.retriever.index.MultiFieldIndexConfig(indexed_fields, merge_method='max')[source]#

Configuration for MultiFieldIndex.

Parameters:

indexed_fields (list[str]) – Fields to be indexed. If more than one field is specified, each field will be processed separately and pointed to the same id.
merge_method (str) – The method to merge the scores of the same context id. Available options are “max”, “sum”, “mean”, and “concat”. “max” will take the maximum score of the same context id. “sum” will take the sum of the scores of the same context id. “mean” will take the average of the scores of the same context id. “concat” will concatenate the texts of each field and index them together. Note that “concat” is only available if all indexed fields are of type str. If only one field is specified, this argument will be ignored. Defaults to “max”.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.MultiFieldIndex(cfg, index)[source]#

Bases: object

A wrapper index for multiple field contexts.

build_index(context_ids, data)[source]#

Build the index. The index will be serialized automatically if the index_path is set.

Parameters:

context_ids (Iterable[str]) – The context ids of the data.
data (Iterable[dict[str, Any]]) – The data to build the index.

Returns:

None

clear()[source]#: Clear the index.

insert(context_ids, data, serialize=True)[source]#

Add a batch of data to the index.

Parameters:

context_ids (list[str]) – The context ids of the data.
data (list[dict[str, Any]]) – The data to add.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.

Returns:

None

insert_batch(context_ids, data, batch_size=None, serialize=True)[source]#

Add data to the index in batches. This method will automatically perform the serialize method if the index_path is set.

Parameters:

context_ids (Iterable[str]) – The context ids of the data.
data (Iterable[dict[str, Any]]) – The data to add.
batch_size (int) – The batch size to add data to the index. Defaults to self.batch_size.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.

Returns:

None

property is_addable#: Check if the index is addable.

save_to_local(index_path=None)[source]#

Serialize the index to the given path.

Parameters:: index_path (str) – The path to save the index. If None, the index will be saved to self.index.cfg.index_path.
Returns:: None

search(query, top_k, **search_kwargs)[source]#

Search for the top_k most similar data indices to the query.

Parameters:

query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.

Returns:

The indices and scores of the top_k most similar data indices.

Return type:

tuple[list[list[str]], np.ndarray]

search_batch(query, top_k, **search_kwargs)[source]#

Search for the top_k most similar data indices to the query. This method will search the index in batches.

Parameters:

query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
batch_size (Optional[int]) – The batch size to search. Defaults to self.batch_size.
search_kwargs (Any) – Additional search arguments.

Returns:

The indices and scores of the top_k most similar data indices.

Return type:

tuple[list[list[str]], np.ndarray]

Web Retriever#

WebRetriever is used to retrieve data from the web. Different from the EditableRetriever, web retrievers can be used without building a knowledge base, as they retrieve data using web search engines.

class flexrag.retriever.web_retrievers.WebRetrieverBaseConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retry_times=3, retry_delay=0.5)[source]#

The configuration for the WebRetrieverBase.

Parameters:

retry_times (int) – The number of times to retry. Default is 3.
retry_delay (float) – The delay between retries. Default is 0.5.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.WebRetrieverBase(cfg)[source]#

The base class for the WebRetriever.

The WebRetriever is used to retrieve relevant information from the web. The subclasses should implement the search_item method.

async async_search(query, **search_kwargs)#: Search queries asynchronously.

abstract property fields#: The fields of the retrieved data.

search(**kwargs)#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

abstract search_item(query, top_k, **search_kwargs)[source]#

Search the query from the web.

Parameters:

query (str) – The query to search.
top_k (int) – The number of documents to return.

Returns:

The retrieved contexts.

Return type:

list[RetrievedContext]

test_speed(sample_num=10000, test_times=10, **search_kwargs)#

Test the speed of the retriever.

Parameters:

sample_num (int, optional) – The number of samples to test.
test_times (int, optional) – The number of times to test.

Returns:

The time consumed for retrieval.

Return type:

float

class flexrag.retriever.web_retrievers.WebResource(url, query=None, metadata=<factory>, data=None)[source]#

The web resource dataclass. WebResource is the fundamental component for information transmission in the web_retrievers module of FlexRAG. The WebSeeker retrieves the corresponding WebResource based on the user’s query, while the WebDownloader downloads the resource based on the URL in the WebResource and stores it in the data field of the WebResource. The WebReader then converts the data field of the WebResource into a LLM friendly format and returns the RetrievedContext.

Parameters:

url (str) – The URL of the resource.
query (Optional[str]) – The query for the resource. Default is None.
metadata (dict) – The metadata of the resource, offen provided by the WebSeeker. Default is {}.
data (Any) – The content of the resource, offen filled by the WebDownloader. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

FlexRAG provides two simple web retrievers, SimpleWebRetriever and WikipediaRetriever.

class flexrag.retriever.SimpleWebRetrieverConfig(search_engine_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>, web_reader_type=None, jina_readerlm_config=<factory>, jina_reader_config=<factory>, screenshot_config=<factory>, log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retry_times=3, retry_delay=0.5)[source]#

The configuration for the SimpleWebRetriever.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.SimpleWebRetriever(cfg)[source]#

Bases: WebRetrieverBase

SimpleWebRetriever seeks most relevant web pages using existing search engine and reads the content using the WebReader.

property fields#: The fields of the retrieved data.

search_item(query, top_k=10, **search_kwargs)[source]#

Search the query from the web.

Parameters:

query (str) – The query to search.
top_k (int) – The number of documents to return.

Returns:

The retrieved contexts.

Return type:

list[RetrievedContext]

class flexrag.retriever.WikipediaRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, search_url='https://en.wikipedia.org/w/index.php?search=', proxy=None)[source]#

The configuration for the WikipediaRetriever.

Parameters:

search_url (str) – The search URL for Wikipedia. Default is “https://en.wikipedia.org/w/index.php?search=”.
proxy (Optional[str]) – The proxy to use. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.WikipediaRetriever(cfg)[source]#

Bases: RetrieverBase

WikipediaRetriever retrieves information from Wikipedia directly. Adapted from ysymyth/ReAct

property fields#: The fields of the retrieved data.

search(query, delay=0.1, **search_kwargs)[source]#

Search a batch of queries.

Parameters:

query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.

Returns:

A batch of list that contains k RetrievedContext.

Return type:

list[list[RetrievedContext]]

Web Seeker#

WebSeeker is used to search the resources from the web for the given query. The web resources could be sought by walking through a set of given web pages, by using a search engine, etc. FlexRAG provides several web seekers using existing search engines.

class flexrag.retriever.web_retrievers.WebSeekerBase[source]#

The base class for the WebSeeker. The WebSeeker is used to seek the web resources for a given query. The web resources could be sought by walking through a set of given web pages, by using a search engine, etc.

The subclasses should implement the seek method.

abstract seek(query, top_k=10, **kwargs)[source]#

Seek the web resources.

Parameters:

query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.

Returns:

The web resources.

Return type:

list[WebResource]

class flexrag.retriever.web_retrievers.WebSeekerConfig(web_seeker_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>)#

Configuration class for web_seeker (name: WebSeekerConfig, default: None).

Parameters:

web_seeker_type (str) – The web_seeker type to use.
bing_config (BingEngineConfig) – The config for BingEngine.
ddg_config (DuckDuckGoEngineConfig) – The config for DuckDuckGoEngine.
google_config (GoogleEngineConfig) – The config for GoogleEngine.
serpapi_config (SerpApiConfig) – The config for SerpApi.

WebSeekerConfig is the general configuration for all registered WebSeekers. You can load any WebSeekers by specifying the web_seeker_type in the configuration. For example, to load the DuckDuckGoEngine, you can use the following configuration:

from flexrag.retriever.web_retrievers import WebSeekerConfig, WEB_SEEKERS

config = WebSeekerConfig(
    web_seeker_type='ddg',
)
seeker = WEB_SEEKERS.load(config)

class flexrag.retriever.web_retrievers.SearchEngineConfig(search_engine_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>)#

Configuration class for search_engine (name: SearchEngineConfig, default: None).

Parameters:

search_engine_type (str) – The search_engine type to use.
bing_config (BingEngineConfig) – The config for BingEngine.
ddg_config (DuckDuckGoEngineConfig) – The config for DuckDuckGoEngine.
google_config (GoogleEngineConfig) – The config for GoogleEngine.
serpapi_config (SerpApiConfig) – The config for SerpApi.

SearchEngine is a type of WebSeeker that searches for web resources by leveraging existing search engines. SearchEngineConfig is the general configuration for all registered SearchEngines. You can load any SearchEngines by specifying the search_engine_type in the configuration. For example, to load the DuckDuckGoEngine, you can use the following configuration:

from flexrag.retriever.web_retrievers import SearchEngineConfig, SEARCH_ENGINES

config = SearchEngineConfig(
    search_engine_type='ddg',
)
seeker = SEARCH_ENGINES.load(config)

class flexrag.retriever.web_retrievers.BingEngineConfig(subscription_key='EMPTY', base_url='https://api.bing.microsoft.com/v7.0/search', timeout=3.0, market='en-US', lang='en', freshness=None)[source]#

The configuration for the BingSeeker.

Parameters:

subscription_key (str) – The subscription key for the Bing Search API. Default is os.environ.get(“BING_SEARCH_KEY”, “EMPTY”).
base_url (str) – The base_url for the Bing Search API. Default is “https://api.bing.microsoft.com/v7.0/search”.
timeout (float) – The timeout for the requests. Default is 3.0.
market (str) – The market to use. see https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes for more information). Default is en-US.
lang (str) – The language to use. Default is “en”.
freshness (Optional[str]) – To get articles discovered by Bing during a specific timeframe, specify a date range in the form, YYYY-MM-DD..YYYY-MM-DD. For example, &freshness=2019-02-01..2019-05-30. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.BingEngine(cfg)[source]#

Bases: WebSeekerBase

The BingSeeker retrieves the web pages using the Bing Search API.

seek(query, top_k=10, **search_kwargs)[source]#

Seek the web resources.

Parameters:

query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.

Returns:

The web resources.

Return type:

list[WebResource]

class flexrag.retriever.web_retrievers.DuckDuckGoEngineConfig(proxy=None)[source]#

The configuration for the DuckDuckGoEngine.

Parameters:: proxy (Optional[str]) – The proxy to use. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.DuckDuckGoEngine(cfg)[source]#

Bases: WebSeekerBase

The DuckDuckGoEngine retrieves the web pages using the DuckDuckGo Search API.

seek(query, top_k=10, **search_kwargs)[source]#

Seek the web resources.

Parameters:

query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.

Returns:

The web resources.

Return type:

list[WebResource]

class flexrag.retriever.web_retrievers.GoogleEngineConfig(subscription_key=None, search_engine_id=None, endpoint='https://customsearch.googleapis.com/customsearch/v1', proxy=None, timeout=3.0)[source]#

The configuration for the GoogleEngine.

Parameters:

subscription_key (str) – The subscription key for the Google Search API. If not provided, it will use the environment variable GOOGLE_SEARCH_KEY. Defaults to None.
search_engine_id (str) – The search engine id for the Google Search API. If not provided, it will use the environment variable GOOGLE_SEARCH_ENGINE_ID. Defaults to None.
endpoint (str) – The endpoint for the Google Search API. Default is “https://customsearch.googleapis.com/customsearch/v1”.
proxy (Optional[str]) – The proxy to use. Default is None.
timeout (float) – The timeout for the requests. Default is 3.0.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.GoogleEngine(cfg)[source]#

Bases: WebSeekerBase

The GoogleEngine retrieves the web pages using the Google Custom Search API.

seek(query, top_k=10, **search_kwargs)[source]#

Seek the web resources.

Parameters:

query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.

Returns:

The web resources.

Return type:

list[WebResource]

class flexrag.retriever.web_retrievers.SerpApiConfig(api_key=None, engine='google', country='us', language='en')[source]#

The configuration for the SerpApi.

Parameters:

api_key (str) – The API key for the SerpApi. If not provided, it will use the environment variable SERPAPI_API_KEY. Defaults to None.
engine (str) – The search engine to use. Default is “google”. Available choices are “google”, “bing”, “baidu”, “yandex”, “yahoo”, “google_scholar”, “duckduckgo”.
country (str) – The country to search. Default is “us”.
language (str) – The language to search. Default is “en”.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.SerpApi(cfg)[source]#

Bases: WebSeekerBase

The SerpApi retrieves the web pages using the SerpApi <https://serpapi.com/>_.

seek(query, top_k=10, **search_kwargs)[source]#

Seek the web resources.

Parameters:

query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.

Returns:

The web resources.

Return type:

list[WebResource]

Web Downloader#

Web downloader is used to download data from the web.

class flexrag.retriever.web_retrievers.WebDownloaderBaseConfig(allow_parallel=True)[source]#

The configuration for the WebDownloaderBase.

Parameters:: allow_parallel (bool) – Whether to allow parallel downloading. Default is True.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.WebDownloaderBase(cfg)[source]#

The base class for the WebDownloader.

async async_download(resources)[source]#: Download the web resources asynchronously.

download(resources)[source]#

Download the web resources.

Parameters:: resources (WebResource | list[WebResource]) – The resources to download.
Returns:: The downloaded web resources.
Return type:: list[WebResource]

class flexrag.retriever.web_retrievers.SimpleWebDownloaderConfig(allow_parallel=True, proxy=None, timeout=3.0, headers=None)[source]#

The configuration for the SimpleWebDownloader.

Parameters:

proxy (Optional[str]) – The proxy to use. Default is None.
timeout (float) – The timeout for the requests. Default is 3.0.
headers (Optional[dict]) – The headers to use. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.SimpleWebDownloader(cfg)[source]#

Bases: WebDownloaderBase

Download the html content using httpx.

class flexrag.retriever.web_retrievers.PlaywrightWebDownloaderConfig(allow_parallel=True, headless=True, browser='chromium', device='Desktop Chrome', page_width=None, page_height=None, proxy=None, return_screenshot=False)[source]#

The configuration for the PlaywrightWebDownloader.

Parameters:

headless (bool) – Whether to run the browser in headless mode. Default is True.
browser (str) – The browser to use. Default is chromium. Available choices are chromium, firefox, webkit, and msedge.
device (str) – The device to emulate. Default is Desktop Chrome.
page_width (Optional[int]) – The width of the emulate device. Default is None.
page_height (Optional[int]) – The height of the emulate device. Default is None.
proxy (Optional[str]) – The proxy to use. Default is None.
return_screenshot (bool) – Whether to return the screenshot. Default is False.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.PlaywrightWebDownloader(cfg)[source]#

Bases: WebDownloaderBase

Download the web resources using playwright.

async async_download(resources)[source]#: Download the web resources asynchronously.

download(resources)[source]#

Download the web resources.

Parameters:: resources (WebResource | list[WebResource]) – The resources to download.
Returns:: The downloaded web resources.
Return type:: list[WebResource]

Web Reader#

Web reader is used to convert web data into LLM friendly format.

class flexrag.retriever.web_retrievers.WebReaderBase[source]#

The base class for the WebReader. The WebReader is used to parse the web resources into a format that can be fed into the LLM.

abstract property fields#: The fields that the reader will return.

abstract read(resources)[source]#

Parse the retrieved contexts into LLM readable format.

Parameters:: resources (list[WebResource]) – Resources sought from the web.
Returns:: Contexts that can be fed into the LLM.
Return type:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.WebReaderConfig(web_reader_type=None, jina_readerlm_config=<factory>, jina_reader_config=<factory>, screenshot_config=<factory>)#

Configuration class for web_reader (name: WebReaderConfig, default: None).

Parameters:

web_reader_type (str) – The web_reader type to use.
jina_readerlm_config (JinaReaderLMConfig) – The config for JinaReaderLM.
jina_reader_config (JinaReaderConfig) – The config for JinaReader.
screenshot_config (ScreenshotWebReaderConfig) – The config for ScreenshotWebReader.

WebReaderConfig is the general configuration for all registered WebReaders. You can load any WebReader by specifying the web_reader_type in the configuration. For example, to load the JinaReader, you can use the following configuration:

from flexrag.retriever.web_retrievers import WebReaderConfig, WEB_READERS

config = WebReaderConfig(
    web_reader_type='jina_reader',
)
seeker = WEB_READERS.load(config)

class flexrag.retriever.web_retrievers.JinaReaderConfig(base_url='https://r.jina.ai', api_key=None, proxy=None)[source]#

The configuration for the JinaReader.

Parameters:

base_url (str) – The base URL of the Jina Reader API. Default is “https://r.jina.ai”.
api_key (str) – The API key for the Jina Reader API. If not provided, it will use the environment variable JINA_API_KEY. Defaults to None.
proxy (Optional[str]) – The proxy to use. Defaults to None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.JinaReader(cfg)[source]#

Bases: WebReaderBase

The JinaReader parse the web pages using the Jina Reader API.

property fields#: The JinaReader will return the processed_content field.

read(resources)[source]#

Parse the retrieved contexts into LLM readable format.

Parameters:: resources (list[WebResource]) – Resources sought from the web.
Returns:: Contexts that can be fed into the LLM.
Return type:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.JinaReaderLMConfig(do_sample=True, sample_num=1, temperature=1.0, max_new_tokens=512, top_p=0.9, top_k=50, eos_token_id=None, stop_str=<factory>, web_downloader_type=None, simple_config=<factory>, playwright_config=<factory>, generator_type=None, anthropic_config=<factory>, hf_config=<factory>, hf_vlm_config=<factory>, ollama_config=<factory>, openai_config=<factory>, vllm_config=<factory>, use_v2_prompt=False, pre_clean_html=False, clean_svg=False, clean_base64=False)[source]#

The configuration for the JinaReaderLM.

Parameters:

use_v2_prompt (bool) – Whether to use the jinaai/ReaderLM-v2 prompt. Default is False.
pre_clean_html (bool) – Whether to pre-clean the HTML content. Default is False.
clean_svg (bool) – Whether to clean the SVG content. Default is False.
clean_base64 (bool) – Whether to clean the base64 images. Default is False.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.JinaReaderLM(cfg)[source]#

Bases: WebReaderBase

The JinaReaderLM downloads and parses the HTML content using the Jina ReaderLM model.

property fields#: The JinaReaderLM will return the raw_content and processed_content fields.

read(resources)[source]#

Parse the retrieved contexts into LLM readable format.

Parameters:: resources (list[WebResource]) – Resources sought from the web.
Returns:: Contexts that can be fed into the LLM.
Return type:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.ScreenshotWebReader(cfg)[source]#

Bases: WebReaderBase

The ScreenshotWebReader reads the web pages by taking screenshots.

property fields#: The ScreenshotWebReader will return the screenshot field.

read(resources)[source]#

Parse the retrieved contexts into LLM readable format.

Parameters:: resources (list[WebResource]) – Resources sought from the web.
Returns:: Contexts that can be fed into the LLM.
Return type:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.ScreenshotWebReaderConfig(allow_parallel=True, headless=True, browser='chromium', device='Desktop Chrome', page_width=None, page_height=None, proxy=None, return_screenshot=True)[source]#

The configuration for the ScreenshotWebReader.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.SnippetWebReader[source]#

The SnippetWebReader will return the snippet of the resource directly.

This is useful if the resources are retrieved by the SearchEngine, and the snippets are sufficient for the LLM to generate the response.

property fields#: The SnippetWebReader will return the snippet field.

read(resources)[source]#

Parse the retrieved contexts into LLM readable format.

Parameters:: resources (list[WebResource]) – Resources sought from the web.
Returns:: Contexts that can be fed into the LLM.
Return type:: list[RetrievedContext]

Retrievers

Contents

Retrievers#

The Retriever Interface#

Editable Retriever#

Retriever Index#

Web Retriever#

Web Seeker#

Web Downloader#

Web Reader#