Retrievers

Retrievers#

Retrievers are used to retrieve data from the local knowledge base or the web.

The Retriever Interface#

RetrieverBase is the base class for all retrievers, including the subclasses of EditableRetriever and WebRetrieverBase.

class flexrag.retriever.RetrieverBaseConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>)[源代码]#

Base configuration class for all retrievers.

参数:

log_interval (int) -- The interval of logging. Default: 100.
top_k (int) -- The number of retrieved documents. Default: 10.
batch_size (int) -- The batch size for retrieval. Default: 32.
query_preprocess_pipeline (TextProcessPipelineConfig) -- The text process pipeline for query. Default: TextProcessPipelineConfig.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.RetrieverBase(cfg)[源代码]#

The base class for all retrievers. The subclasses should implement the search method and the fields property.

async async_search(query, **search_kwargs)[源代码]#: Search queries asynchronously.

abstract property fields#: The fields of the retrieved data.

abstract search(query, **search_kwargs)[源代码]#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

test_speed(sample_num=10000, test_times=10, **search_kwargs)[源代码]#

Test the speed of the retriever.

参数:

sample_num (int, optional) -- The number of samples to test.
test_times (int, optional) -- The number of times to test.

返回:

The time consumed for retrieval.

返回类型:

float

class flexrag.retriever.RetrieverConfig(retriever_type=None, elastic_config=<factory>, flex_config=<factory>, hyde_config=<factory>, typesense_config=<factory>, simple_web_config=<factory>, wikipedia_config=<factory>)#

Configuration class for retriever (name: RetrieverConfig, default: None).

参数:

retriever_type (str) -- The retriever type to use.
elastic_config (ElasticRetrieverConfig) -- The config for ElasticRetriever.
flex_config (FlexRetrieverConfig) -- The config for FlexRetriever.
hyde_config (HydeRetrieverConfig) -- The config for HydeRetriever.
typesense_config (TypesenseRetrieverConfig) -- The config for TypesenseRetriever.
simple_web_config (SimpleWebRetrieverConfig) -- The config for SimpleWebRetriever.
wikipedia_config (WikipediaRetrieverConfig) -- The config for WikipediaRetriever.

RetrieverConfig is the general configuration for all registered retrievers. You can load any retriever by specifying the retriever name in the configuration. For example, to load the pre-built FlexRetriever retriever, you can use the following configuration:

from flexrag.retriever import RetrieverConfig, RETRIEVERS, FlexRetrieverConfig

config = RetrieverConfig(
    retriever_type='flex',
    flex_config=FlexRetrieverConfig(
        retriever_path='<path_to_retriever>',
    )
)
retriever = RETRIEVERS.load(config)

Editable Retriever#

class flexrag.retriever.EditableRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>)[源代码]#

Configuration class for LocalRetriever.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.EditableRetriever(cfg)[源代码]#

基类：RetrieverBase

The base class for all editable retrievers. In FlexRAG, the EditableRetriever is a concept referring to a retriever that includes the add_passages and clear methods, allowing you to build the retriever using your own knowledge base. FlexRAG provides following editable retrievers: FlexRetriever, ElasticRetriever, TypesenseRetriever, and HydeRetriever. The subclasses should implement the add_passages, clear, and __len__ methods.

abstract add_passages(passages)[源代码]#

Add passages to the retriever database.

参数:: passages (Iterable[Context]) -- The passages to add.
返回:: None

abstract clear()[源代码]#: Clear the retriever database.

class flexrag.retriever.ElasticRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, host='http://localhost:9200', api_key=None, index_name=None, custom_properties=None, verbose=False, retry_times=3, retry_delay=0.5)[源代码]#

Configuration class for ElasticRetriever.

参数:

host (str) -- Host of the ElasticSearch server. Default: "http://localhost:9200".
api_key (Optional[str]) -- API key for the ElasticSearch server. Default: None.
index_name (str) -- Name of the index. Required.
custom_properties (Optional[dict]) -- Custom properties for building the index. Default: None.
verbose (bool) -- Enable verbose logging mode. Default: False.
retry_times (int) -- Number of retry times. Default: 3.
retry_delay (float) -- Delay time for retry. Default: 0.5.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.ElasticRetriever(cfg)[源代码]#

基类：EditableRetriever

add_passages(**kwargs)#

Add passages to the retriever database.

参数:: passages (Iterable[Context]) -- The passages to add.
返回:: None

clear()[源代码]#: Clear the retriever database.

property fields#: The fields of the retrieved data.

search(**kwargs)#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

class flexrag.retriever.TypesenseRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, host='localhost', port=8108, protocol='http', api_key=None, index_name=None, timeout=200.0)[源代码]#

Configuration class for TypesenseRetriever.

参数:

host (str) -- Host of the Typesense server. Default: "localhost".
port (int) -- Port of the Typesense server. Default: 8108.
protocol (str) -- Protocol of the Typesense server. Default: "http". Available options: "https", "http".
api_key (str) -- API key for the Typesense server. Required.
index_name (str) -- Name of the Typesense collection. Required.
timeout (float) -- Timeout for the connection. Default: 200.0.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.TypesenseRetriever(cfg)[源代码]#

基类：EditableRetriever

add_passages(**kwargs)#

Add passages to the retriever database.

参数:: passages (Iterable[Context]) -- The passages to add.
返回:: None

clear()[源代码]#: Clear the retriever database.

property fields#: The fields of the retrieved data.

search(**kwargs)#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

class flexrag.retriever.LocalRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None)[源代码]#

The configuration class for LocalRetriever.

参数:: retriever_path (Optional[str]) -- The path to the local database. Default: None. If specified, all modifications to the retriever will be applied simultaneously on the disk. If not specified, the retriever will be kept in memory.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.LocalRetriever(cfg)[源代码]#

基类：EditableRetriever

The base class for all local retrievers.

In FlexRAG, the LocalRetriever is a concept referring to a retriever that can be saved to the local disk. The subclasses provide the save_to_local and load_from_local methods to save and load the retriever from the local disk, and the save_to_hub and load_from_hub methods to save and load the retriever from the HuggingFace Hub.

FlexRAG provides following local retrievers: FlexRetriever, and HydeRetriever.

For example, to load a retriever hosted on the HuggingFace Hub, you can run the following code:

from flexrag.retriever import LocalRetriever

retriever = LocalRetriever.load_from_hub("flexrag/wiki2021_atlas_bm25s")

To save a retriever to the HuggingFace Hub, you can run the following code:

retriever.save_to_hub("<your-repo-id>", token="<your-token>")

abstract detach()[源代码]#: Detach the retriever from the local database. After detaching, the retriever will be kept in memory and all modifications will not be applied to the disk.

static load_from_hub(repo_id, revision=None, token=None, cache_dir='/home/docs/.cache/flexrag', **kwargs)[源代码]#

Load a retriever from the HuggingFace Hub.

参数:

repo_id (str) -- The repo id of the retriever on the HuggingFace Hub.
revision (str) -- The revision of the retriever on the HuggingFace Hub. Default: None.
token (str) -- The token to access the HuggingFace Hub. Default: None.
cache_dir (str) -- The cache directory to store the retriever. Default: FLEXRAG_CACHE_DIR.
kwargs (Any) -- Additional arguments for the retriever.

返回:

The loaded retriever.

返回类型:

LocalRetriever

static load_from_local(repo_path=None, **kwargs)[源代码]#

Load a retriever from the local disk.

参数:: repo_path (str) -- The path to the local database. Default: None.
返回:: The loaded retriever.
返回类型:: LocalRetriever

save_to_hub(repo_id, token=None, commit_message='Update FlexRAG retriever', retriever_card=None, private=False, **kwargs)[源代码]#

Save the retriever to the HuggingFace Hub.

参数:

repo_id (str) -- The repo id of the retriever on the HuggingFace Hub.
token (str) -- The token to access the HuggingFace Hub. Default: None.
commit_message (str) -- The commit message for the retriever. Default: "Update FlexRAG retriever".
retriever_card (str) -- The markdown readme file for the retriever. Default: None.
private (bool) -- Whether to create a private repo. Default: False.
kwargs (Any) -- Additional arguments for uploading the retriever.

返回:

The repo url of the retriever.

返回类型:

str

abstract save_to_local(retriever_path=None)[源代码]#

Save the retriever to the local disk.

参数:: retriever_path (str) -- The path to the local database. Default: None.
返回:: None
返回类型:: None

class flexrag.retriever.FlexRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None, indexes_merge_method='rrf', indexes_merge_weights=None, used_indexes=None, rrf_base=60)[源代码]#

Configuration class for FlexRetriever.

参数:

indexes_merge_method (str) -- Method to merge the scores of multiple indexes. Available choices are "rrf" and "linear". Default is "rrf". * "rrf": Reciprocal Rank Fusion (RRF) method. * "linear": Linear combination of the scores.
merge_weights (Optional[list[float]]) -- List of weights for each index. Default is None. If None, all indexes will be treated equally. This option is used in both "rrf" and "linear" methods.
used_indexes (Optional[list[str]]) -- List of indexes to use for retrieval. Default is None. If None, all indexes will be used.
rrf_base (int) -- Base for the RRF method. Default is 60. This option is only used when indexes_merge_method is "rrf".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.FlexRetriever(cfg)[源代码]#

基类：LocalRetriever

FlexRetriever is a retriever implemented by FlexRAG team. FlexRetriever supports multi-index and multi-field retrieval.

add_passages(**kwargs)#

Add passages to the retriever database.

参数:: passages (Iterable[Context]) -- The passages to add.
返回:: None

clear()[源代码]#: Clear the retriever database.

detach()[源代码]#: Detach the retriever from the local disk to memory. This function will not delete the database or the indexes.

property fields#: The fields of the retrieved data.

remove_index(index_name)[源代码]#

Remove an index from the retriever.

参数:: index_name (str) -- Name of the index.
抛出:: ValueError -- If the index name does not exist.
返回:: None
返回类型:: None

save_to_local(retriever_path=None)[源代码]#

Save the retriever to the local disk.

参数:: retriever_path (str) -- The path to the local database. Default: None.
返回:: None
返回类型:: None

search(**kwargs)#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

class flexrag.retriever.HydeRetrieverConfig(generator_type=None, anthropic_config=<factory>, hf_config=<factory>, hf_vlm_config=<factory>, ollama_config=<factory>, openai_config=<factory>, vllm_config=<factory>, log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None, indexes_merge_method='rrf', indexes_merge_weights=None, used_indexes=None, rrf_base=60, task='WEB_SEARCH', language='en')#

Configuration class for HydeRetriever.

参数:

task (str) -- Task for rewriting the query. Default: "WEB_SEARCH". Available options: "WEB_SEARCH", "SCIFACT", "ARGUANA", "TREC_COVID", "FIQA", "DBPEDIA_ENTITY", "TREC_NEWS", "MR_TYDI".
language (str) -- Language for rewriting. Default: "en".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.HydeRetriever(cfg, no_check=False)#

基类：FlexRetriever

HydeRetriever is a retriever that rewrites the query before searching.

The original paper is available at https://aclanthology.org/2023.acl-long.99/.

search(query, **search_kwargs)#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

Retriever Index#

RetrieverIndex is used in FlexRetriever to store and retrieve dense embeddings.

class flexrag.retriever.index.RetrieverIndexBase[源代码]#

The base class for all retriever indexes. This class provides the basic interface for building, adding, and searching the index.

The subclass should implement the following methods: - build_index: Build the index from the data. - insert: Add a batch of data to the index. - search: Search for the top_k most similar data indices to the query. - serialize: Serialize the index to the disk. - clear: Clear the index and remove the serialized index files. - __len__: Return the number of data in the index. - is_addable: Return whether the index is addable.

abstract build_index(data)[源代码]#

Build the index. The index will be serialized automatically if the index_path is set.

参数:: data (Iterable[Any]) -- The data to build the index.
返回:: None

abstract clear()[源代码]#: Reset the index and remove the serialized index files.

abstract property infimum#: Return the infimum of the similarity scores for the index.

abstract insert(data, serialize=True)[源代码]#

Add a batch of data to the index.

参数:

data (list[Any]) -- The data to add.
serialize (bool) -- Whether to serialize the index after adding data. Defaults to True.

返回:

None

insert_batch(data, batch_size=None, serialize=True)[源代码]#

Add data to the index in batches. This method will automatically perform the serialize method if the index_path is set.

参数:

data (Iterable[Any]) -- The data to add.
batch_size (int) -- The batch size to add data to the index. Defaults to self.batch_size.
serialize (bool) -- Whether to serialize the index after adding data. Defaults to True.

返回:

None

static load_from_local(index_path)[源代码]#

Load the index from the local path.

参数:: index_path (str) -- The path to load the index.

abstract save_to_local(index_path=None)[源代码]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

参数:: index_path (str, optional) -- The path to serialize the index. Defaults to self.index_path.

abstract search(query, top_k, **search_kwargs)[源代码]#

Search for the top_k most similar data indices to the query.

参数:

query (list[Any]) -- The query data.
top_k (int, optional) -- The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) -- Additional search arguments.

返回:

The indices and scores of the top_k most similar data indices.

返回类型:

tuple[np.ndarray, np.ndarray]

abstract property supremum#: Return the supremum of the similarity scores for the index.

class flexrag.retriever.index.RetrieverIndexBaseConfig(log_interval=10000, batch_size=512, index_path=None)[源代码]#

The configuration for the RetrieverIndexBase.

Log_interval:: The interval to log the progress. Defaults to 10000.
Batch_size:: The batch size to add data to the index. Defaults to 512.
参数:: index_path (Optional[str]) -- The path to save the index. If not specified, the index will be kept in memory. Defaults to None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.RetrieverIndexConfig(index_type='faiss', bm25_config=<factory>, faiss_config=<factory>, scann_config=<factory>)#

Configuration class for index (name: RetrieverIndexConfig, default: faiss).

参数:

index_type (str) -- The index type to use.
bm25_config (BM25IndexConfig) -- The config for BM25Index.
faiss_config (FaissIndexConfig) -- The config for FaissIndex.
scann_config (ScaNNIndexConfig) -- The config for ScaNNIndex.

RetrieverConfig is the general configuration for all registered RetrieverIndexes. You can load any RetrieverIndex by specifying the index_type in the configuration. For example, to load the BM25Index, you can use the following configuration:

from flexrag.retriever.index import RetrieverIndexConfig, RETRIEVER_INDEX, BM25IndexConfig

config = RetrieverIndexConfig(
    index_type='bm25',
    bm25_config=BM25IndexConfig(
        index_path='<path_to_index>',
    )
)
index = RETRIEVER_INDEX.load(config)

class flexrag.retriever.index.FaissIndexConfig(log_interval=10000, batch_size=512, index_path=None, query_encoder_config=<factory>, passage_encoder_config=<factory>, distance_function='IP', index_type='auto', n_subquantizers=8, n_bits=8, n_list=1000, factory_str=None, index_train_num=-1, n_probe=None, device_id=<factory>, k_factor=10, polysemous_ht=0, efSearch=100)[源代码]#

The configuration for the FaissIndex.

参数:

index_type (str) -- Building param: the type of the index. Defaults to "auto". available choices are "FLAT", "IVF", "PQ", "IVFPQ", and "auto". If set to "auto", the index will be set to "IVF{n_list},PQ{embedding_size//2}x4fs".
n_subquantizers (int) -- Building param: the number of subquantizers. Defaults to 8. This parameter is only used when the index type is "PQ" or "IVFPQ".
n_bits (int) -- Building param: the number of bits per subquantizer. Defaults to 8. This parameter is only used when the index type is "PQ" or "IVFPQ".
n_list (int) -- Building param: the number of cells. Defaults to 1000. This parameter is only used when the index type is "IVF" or "IVFPQ".
factory_str (Optional[str]) -- Building param: the factory string to build the index. Defaults to None. If set, the index_type will be ignored.
index_train_num (int) -- Building param: the number of data used to train the index. Defaults to -1. If set to -1, all data will be used to train the index.
n_probe (Optional[int]) -- Inference param: the number of probes. Defaults to None. If not set, the number of probes will be set to n_list // 8. This parameter is only used when the index type is "IVF" or "IVFPQ".
device_id (list[int]) -- Inference param: the device(s) to use. Defaults to []. [] means CPU. If set, the index will be accelerated with GPU.
k_factor (int) -- Inference param: the k factor for search. Defaults to 10.
polysemous_ht (int) -- Inference param: the polysemous hash table. Defaults to 0.
efSearch (int) -- Inference param: the efSearch for HNSW. Defaults to 100.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.FaissIndex(cfg)[源代码]#

基类：DenseIndexBase

FaissIndex employs faiss library to build and search indexes with embeddings. FaissIndex supports both CPU and GPU acceleration. FaissIndex supports various index types, including FLAT, IVF, PQ, IVFPQ, and auto. FaissIndex provides a flexible and efficient way to build and search indexes with embeddings.

add_embeddings(embeddings)[源代码]#

A helper function that adds embeddings to the index.

参数:: embeds (np.ndarray) -- The embeddings to add.
返回:: None

build_index(data)[源代码]#

Build the index. The index will be serialized automatically if the index_path is set.

参数:: data (Iterable[Any]) -- The data to build the index.
返回:: None

clear()[源代码]#: Reset the index and remove the serialized index files.

property embedding_size#: Return the embedding size of the index.

save_to_local(index_path=None)[源代码]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

参数:: index_path (str, optional) -- The path to serialize the index. Defaults to self.index_path.

search(query, top_k, **search_kwargs)[源代码]#

Search for the top_k most similar data indices to the query.

参数:

query (list[Any]) -- The query data.
top_k (int, optional) -- The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) -- Additional search arguments.

返回:

The indices and scores of the top_k most similar data indices.

返回类型:

tuple[np.ndarray, np.ndarray]

class flexrag.retriever.index.ScaNNIndexConfig(log_interval=10000, batch_size=512, index_path=None, query_encoder_config=<factory>, passage_encoder_config=<factory>, distance_function='IP', num_leaves=2000, num_leaves_to_search=500, num_neighbors=10, anisotropic_quantization_threshold=0.2, dimensions_per_block=2, threads=0, index_train_num=0)[源代码]#

The configuration for the ScaNNIndex.

参数:

num_leaves (int) -- The number of leaves in the tree. Defaults to 2000.
num_leaves_to_search (int) -- The number of leaves to search. Defaults to 500.
num_neighbors (int) -- The number of neighbors to search. Defaults to 10.
anisotropic_quantization_threshold (float) -- The anisotropic quantization threshold. Defaults to 0.2.
dimensions_per_block (int) -- The number of dimensions per block. Defaults to 2.
threads (int) -- The number of threads to use. Defaults to 0 (auto).
index_train_num (int) -- The number of samples to train the index. Defaults to 0 (all).

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.ScaNNIndex(cfg)[源代码]#

基类：DenseIndexBase

ScaNNIndex is a wrapper for the ScaNN library.

ScaNNIndex runs on CPUs with both high speed and accuracy. However, it requires more memory than FaissIndex.

add_embeddings(embeddings)[源代码]#

A helper function that adds embeddings to the index.

参数:: embeds (np.ndarray) -- The embeddings to add.
返回:: None

build_index(data)[源代码]#

Build the index. The index will be serialized automatically if the index_path is set.

参数:: data (Iterable[Any]) -- The data to build the index.
返回:: None

clear()[源代码]#: Reset the index and remove the serialized index files.

property embedding_size#: Return the embedding size of the index.

save_to_local(index_path=None)[源代码]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

参数:: index_path (str, optional) -- The path to serialize the index. Defaults to self.index_path.

search(query, top_k, **search_kwargs)[源代码]#

Search for the top_k most similar data indices to the query.

参数:

query (list[Any]) -- The query data.
top_k (int, optional) -- The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) -- Additional search arguments.

返回:

The indices and scores of the top_k most similar data indices.

返回类型:

tuple[np.ndarray, np.ndarray]

class flexrag.retriever.index.BM25IndexConfig(log_interval=10000, batch_size=512, index_path=None, method='lucene', idf_method=None, backend='auto', k1=1.5, b=0.75, delta=0.5, lang='english')[源代码]#

Configuration class for BM25Index.

参数:

method (str) -- BM25S method. Default: "lucene". Available options: "atire", "bm25l", "bm25+", "lucene", "robertson".
idf_method (Optional[str]) -- IDF method. Default: None. Available options: "atire", "bm25l", "bm25+", "lucene", "robertson".
backend (str) -- Backend for BM25S. Default: "auto". Available options: "numpy", "numba", "auto".
k1 (float) -- BM25S parameter k1. Default: 1.5.
b (float) -- BM25S parameter b. Default: 0.75.
delta (float) -- BM25S parameter delta. Default: 0.5.
lang (str) -- Language for Tokenization. Default: "english".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.BM25Index(cfg)[源代码]#

基类：RetrieverIndexBase

BM25Index is a index that retrieves passages using the BM25 algorithm. The implementation is based on the bm25s project.

build_index(data)[源代码]#

Build the index. The index will be serialized automatically if the index_path is set.

参数:: data (Iterable[Any]) -- The data to build the index.
返回:: None

clear()[源代码]#: Reset the index and remove the serialized index files.

property infimum#: Return the infimum of the similarity scores for the index.

insert(data)[源代码]#

Add a batch of data to the index.

参数:

data (list[Any]) -- The data to add.
serialize (bool) -- Whether to serialize the index after adding data. Defaults to True.

返回:

None

save_to_local(index_path=None)[源代码]#

Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.

参数:: index_path (str, optional) -- The path to serialize the index. Defaults to self.index_path.

search(query, top_k, **search_kwargs)[源代码]#

Search for the top_k most similar data indices to the query.

参数:

query (list[Any]) -- The query data.
top_k (int, optional) -- The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) -- Additional search arguments.

返回:

The indices and scores of the top_k most similar data indices.

返回类型:

tuple[np.ndarray, np.ndarray]

property supremum#: Return the supremum of the similarity scores for the index.

class flexrag.retriever.index.MultiFieldIndexConfig(indexed_fields, merge_method='max')[源代码]#

Configuration for MultiFieldIndex.

参数:

indexed_fields (list[str]) -- Fields to be indexed. If more than one field is specified, each field will be processed separately and pointed to the same id.
merge_method (str) -- The method to merge the scores of the same context id. Available options are "max", "sum", "mean", and "concat". "max" will take the maximum score of the same context id. "sum" will take the sum of the scores of the same context id. "mean" will take the average of the scores of the same context id. "concat" will concatenate the texts of each field and index them together. Note that "concat" is only available if all indexed fields are of type str. If only one field is specified, this argument will be ignored. Defaults to "max".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.index.MultiFieldIndex(cfg, index)[源代码]#

基类：object

A wrapper index for multiple field contexts.

build_index(context_ids, data)[源代码]#

Build the index. The index will be serialized automatically if the index_path is set.

参数:

context_ids (Iterable[str]) -- The context ids of the data.
data (Iterable[dict[str, Any]]) -- The data to build the index.

返回:

None

clear()[源代码]#: Clear the index.

insert(context_ids, data, serialize=True)[源代码]#

Add a batch of data to the index.

参数:

context_ids (list[str]) -- The context ids of the data.
data (list[dict[str, Any]]) -- The data to add.
serialize (bool) -- Whether to serialize the index after adding data. Defaults to True.

返回:

None

insert_batch(context_ids, data, batch_size=None, serialize=True)[源代码]#

Add data to the index in batches. This method will automatically perform the serialize method if the index_path is set.

参数:

context_ids (Iterable[str]) -- The context ids of the data.
data (Iterable[dict[str, Any]]) -- The data to add.
batch_size (int) -- The batch size to add data to the index. Defaults to self.batch_size.
serialize (bool) -- Whether to serialize the index after adding data. Defaults to True.

返回:

None

property is_addable#: Check if the index is addable.

save_to_local(index_path=None)[源代码]#

Serialize the index to the given path.

参数:: index_path (str) -- The path to save the index. If None, the index will be saved to self.index.cfg.index_path.
返回:: None

search(query, top_k, **search_kwargs)[源代码]#

Search for the top_k most similar data indices to the query.

参数:

query (list[Any]) -- The query data.
top_k (int, optional) -- The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) -- Additional search arguments.

返回:

The indices and scores of the top_k most similar data indices.

返回类型:

tuple[list[list[str]], np.ndarray]

search_batch(query, top_k, **search_kwargs)[源代码]#

Search for the top_k most similar data indices to the query. This method will search the index in batches.

参数:

query (list[Any]) -- The query data.
top_k (int, optional) -- The number of most similar data indices to return, defaults to 10.
batch_size (Optional[int]) -- The batch size to search. Defaults to self.batch_size.
search_kwargs (Any) -- Additional search arguments.

返回:

The indices and scores of the top_k most similar data indices.

返回类型:

tuple[list[list[str]], np.ndarray]

Web Retriever#

WebRetriever is used to retrieve data from the web. Different from the EditableRetriever, web retrievers can be used without building a knowledge base, as they retrieve data using web search engines.

class flexrag.retriever.web_retrievers.WebRetrieverBaseConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retry_times=3, retry_delay=0.5)[源代码]#

The configuration for the WebRetrieverBase.

参数:

retry_times (int) -- The number of times to retry. Default is 3.
retry_delay (float) -- The delay between retries. Default is 0.5.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.WebRetrieverBase(cfg)[源代码]#

The base class for the WebRetriever.

The WebRetriever is used to retrieve relevant information from the web. The subclasses should implement the search_item method.

async async_search(query, **search_kwargs)#: Search queries asynchronously.

abstract property fields#: The fields of the retrieved data.

search(**kwargs)#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

abstract search_item(query, top_k, **search_kwargs)[源代码]#

Search the query from the web.

参数:

query (str) -- The query to search.
top_k (int) -- The number of documents to return.

返回:

The retrieved contexts.

返回类型:

list[RetrievedContext]

test_speed(sample_num=10000, test_times=10, **search_kwargs)#

Test the speed of the retriever.

参数:

sample_num (int, optional) -- The number of samples to test.
test_times (int, optional) -- The number of times to test.

返回:

The time consumed for retrieval.

返回类型:

float

class flexrag.retriever.web_retrievers.WebResource(url, query=None, metadata=<factory>, data=None)[源代码]#

The web resource dataclass. WebResource is the fundamental component for information transmission in the web_retrievers module of FlexRAG. The WebSeeker retrieves the corresponding WebResource based on the user's query, while the WebDownloader downloads the resource based on the URL in the WebResource and stores it in the data field of the WebResource. The WebReader then converts the data field of the WebResource into a LLM friendly format and returns the RetrievedContext.

参数:

url (str) -- The URL of the resource.
query (Optional[str]) -- The query for the resource. Default is None.
metadata (dict) -- The metadata of the resource, offen provided by the WebSeeker. Default is {}.
data (Any) -- The content of the resource, offen filled by the WebDownloader. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

FlexRAG provides two simple web retrievers, SimpleWebRetriever and WikipediaRetriever.

class flexrag.retriever.SimpleWebRetrieverConfig(search_engine_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>, web_reader_type=None, jina_readerlm_config=<factory>, jina_reader_config=<factory>, screenshot_config=<factory>, log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retry_times=3, retry_delay=0.5)[源代码]#

The configuration for the SimpleWebRetriever.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.SimpleWebRetriever(cfg)[源代码]#

基类：WebRetrieverBase

SimpleWebRetriever seeks most relevant web pages using existing search engine and reads the content using the WebReader.

property fields#: The fields of the retrieved data.

search_item(query, top_k=10, **search_kwargs)[源代码]#

Search the query from the web.

参数:

query (str) -- The query to search.
top_k (int) -- The number of documents to return.

返回:

The retrieved contexts.

返回类型:

list[RetrievedContext]

class flexrag.retriever.WikipediaRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, search_url='https://en.wikipedia.org/w/index.php?search=', proxy=None)[源代码]#

The configuration for the WikipediaRetriever.

参数:

search_url (str) -- The search URL for Wikipedia. Default is "https://en.wikipedia.org/w/index.php?search=".
proxy (Optional[str]) -- The proxy to use. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.WikipediaRetriever(cfg)[源代码]#

基类：RetrieverBase

WikipediaRetriever retrieves information from Wikipedia directly. Adapted from ysymyth/ReAct

property fields#: The fields of the retrieved data.

search(query, delay=0.1, **search_kwargs)[源代码]#

Search a batch of queries.

参数:

query (list[Any] | Any) -- Queries to search.
search_kwargs (Any) -- Keyword arguments, contains other search arguments.

返回:

A batch of list that contains k RetrievedContext.

返回类型:

list[list[RetrievedContext]]

Web Seeker#

WebSeeker is used to search the resources from the web for the given query. The web resources could be sought by walking through a set of given web pages, by using a search engine, etc. FlexRAG provides several web seekers using existing search engines.

class flexrag.retriever.web_retrievers.WebSeekerBase[源代码]#

The base class for the WebSeeker. The WebSeeker is used to seek the web resources for a given query. The web resources could be sought by walking through a set of given web pages, by using a search engine, etc.

The subclasses should implement the seek method.

abstract seek(query, top_k=10, **kwargs)[源代码]#

Seek the web resources.

参数:

query (str) -- The query to seek.
top_k (int) -- The number of resources to seek. Default is 10.
kwargs -- The additional keyword arguments.

返回:

The web resources.

返回类型:

list[WebResource]

class flexrag.retriever.web_retrievers.WebSeekerConfig(web_seeker_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>)#

Configuration class for web_seeker (name: WebSeekerConfig, default: None).

参数:

web_seeker_type (str) -- The web_seeker type to use.
bing_config (BingEngineConfig) -- The config for BingEngine.
ddg_config (DuckDuckGoEngineConfig) -- The config for DuckDuckGoEngine.
google_config (GoogleEngineConfig) -- The config for GoogleEngine.
serpapi_config (SerpApiConfig) -- The config for SerpApi.

WebSeekerConfig is the general configuration for all registered WebSeekers. You can load any WebSeekers by specifying the web_seeker_type in the configuration. For example, to load the DuckDuckGoEngine, you can use the following configuration:

from flexrag.retriever.web_retrievers import WebSeekerConfig, WEB_SEEKERS

config = WebSeekerConfig(
    web_seeker_type='ddg',
)
seeker = WEB_SEEKERS.load(config)

class flexrag.retriever.web_retrievers.SearchEngineConfig(search_engine_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>)#

Configuration class for search_engine (name: SearchEngineConfig, default: None).

参数:

search_engine_type (str) -- The search_engine type to use.
bing_config (BingEngineConfig) -- The config for BingEngine.
ddg_config (DuckDuckGoEngineConfig) -- The config for DuckDuckGoEngine.
google_config (GoogleEngineConfig) -- The config for GoogleEngine.
serpapi_config (SerpApiConfig) -- The config for SerpApi.

SearchEngine is a type of WebSeeker that searches for web resources by leveraging existing search engines. SearchEngineConfig is the general configuration for all registered SearchEngines. You can load any SearchEngines by specifying the search_engine_type in the configuration. For example, to load the DuckDuckGoEngine, you can use the following configuration:

from flexrag.retriever.web_retrievers import SearchEngineConfig, SEARCH_ENGINES

config = SearchEngineConfig(
    search_engine_type='ddg',
)
seeker = SEARCH_ENGINES.load(config)

class flexrag.retriever.web_retrievers.BingEngineConfig(subscription_key='EMPTY', base_url='https://api.bing.microsoft.com/v7.0/search', timeout=3.0, market='en-US', lang='en', freshness=None)[源代码]#

The configuration for the BingSeeker.

参数:

subscription_key (str) -- The subscription key for the Bing Search API. Default is os.environ.get("BING_SEARCH_KEY", "EMPTY").
base_url (str) -- The base_url for the Bing Search API. Default is "https://api.bing.microsoft.com/v7.0/search".
timeout (float) -- The timeout for the requests. Default is 3.0.
market (str) -- The market to use. see https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes for more information). Default is en-US.
lang (str) -- The language to use. Default is "en".
freshness (Optional[str]) -- To get articles discovered by Bing during a specific timeframe, specify a date range in the form, YYYY-MM-DD..YYYY-MM-DD. For example, &freshness=2019-02-01..2019-05-30. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.BingEngine(cfg)[源代码]#

基类：WebSeekerBase

The BingSeeker retrieves the web pages using the Bing Search API.

seek(query, top_k=10, **search_kwargs)[源代码]#

Seek the web resources.

参数:

query (str) -- The query to seek.
top_k (int) -- The number of resources to seek. Default is 10.
kwargs -- The additional keyword arguments.

返回:

The web resources.

返回类型:

list[WebResource]

class flexrag.retriever.web_retrievers.DuckDuckGoEngineConfig(proxy=None)[源代码]#

The configuration for the DuckDuckGoEngine.

参数:: proxy (Optional[str]) -- The proxy to use. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.DuckDuckGoEngine(cfg)[源代码]#

基类：WebSeekerBase

The DuckDuckGoEngine retrieves the web pages using the DuckDuckGo Search API.

seek(query, top_k=10, **search_kwargs)[源代码]#

Seek the web resources.

参数:

query (str) -- The query to seek.
top_k (int) -- The number of resources to seek. Default is 10.
kwargs -- The additional keyword arguments.

返回:

The web resources.

返回类型:

list[WebResource]

class flexrag.retriever.web_retrievers.GoogleEngineConfig(subscription_key=None, search_engine_id=None, endpoint='https://customsearch.googleapis.com/customsearch/v1', proxy=None, timeout=3.0)[源代码]#

The configuration for the GoogleEngine.

参数:

subscription_key (str) -- The subscription key for the Google Search API. If not provided, it will use the environment variable GOOGLE_SEARCH_KEY. Defaults to None.
search_engine_id (str) -- The search engine id for the Google Search API. If not provided, it will use the environment variable GOOGLE_SEARCH_ENGINE_ID. Defaults to None.
endpoint (str) -- The endpoint for the Google Search API. Default is "https://customsearch.googleapis.com/customsearch/v1".
proxy (Optional[str]) -- The proxy to use. Default is None.
timeout (float) -- The timeout for the requests. Default is 3.0.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.GoogleEngine(cfg)[源代码]#

基类：WebSeekerBase

The GoogleEngine retrieves the web pages using the Google Custom Search API.

seek(query, top_k=10, **search_kwargs)[源代码]#

Seek the web resources.

参数:

query (str) -- The query to seek.
top_k (int) -- The number of resources to seek. Default is 10.
kwargs -- The additional keyword arguments.

返回:

The web resources.

返回类型:

list[WebResource]

class flexrag.retriever.web_retrievers.SerpApiConfig(api_key=None, engine='google', country='us', language='en')[源代码]#

The configuration for the SerpApi.

参数:

api_key (str) -- The API key for the SerpApi. If not provided, it will use the environment variable SERPAPI_API_KEY. Defaults to None.
engine (str) -- The search engine to use. Default is "google". Available choices are "google", "bing", "baidu", "yandex", "yahoo", "google_scholar", "duckduckgo".
country (str) -- The country to search. Default is "us".
language (str) -- The language to search. Default is "en".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.SerpApi(cfg)[源代码]#

基类：WebSeekerBase

The SerpApi retrieves the web pages using the SerpApi <https://serpapi.com/>_.

seek(query, top_k=10, **search_kwargs)[源代码]#

Seek the web resources.

参数:

query (str) -- The query to seek.
top_k (int) -- The number of resources to seek. Default is 10.
kwargs -- The additional keyword arguments.

返回:

The web resources.

返回类型:

list[WebResource]

Web Downloader#

Web downloader is used to download data from the web.

class flexrag.retriever.web_retrievers.WebDownloaderBaseConfig(allow_parallel=True)[源代码]#

The configuration for the WebDownloaderBase.

参数:: allow_parallel (bool) -- Whether to allow parallel downloading. Default is True.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.WebDownloaderBase(cfg)[源代码]#

The base class for the WebDownloader.

async async_download(resources)[源代码]#: Download the web resources asynchronously.

download(resources)[源代码]#

Download the web resources.

参数:: resources (WebResource | list[WebResource]) -- The resources to download.
返回:: The downloaded web resources.
返回类型:: list[WebResource]

class flexrag.retriever.web_retrievers.SimpleWebDownloaderConfig(allow_parallel=True, proxy=None, timeout=3.0, headers=None)[源代码]#

The configuration for the SimpleWebDownloader.

参数:

proxy (Optional[str]) -- The proxy to use. Default is None.
timeout (float) -- The timeout for the requests. Default is 3.0.
headers (Optional[dict]) -- The headers to use. Default is None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.SimpleWebDownloader(cfg)[源代码]#

基类：WebDownloaderBase

Download the html content using httpx.

class flexrag.retriever.web_retrievers.PlaywrightWebDownloaderConfig(allow_parallel=True, headless=True, browser='chromium', device='Desktop Chrome', page_width=None, page_height=None, proxy=None, return_screenshot=False)[源代码]#

The configuration for the PlaywrightWebDownloader.

参数:

headless (bool) -- Whether to run the browser in headless mode. Default is True.
browser (str) -- The browser to use. Default is chromium. Available choices are chromium, firefox, webkit, and msedge.
device (str) -- The device to emulate. Default is Desktop Chrome.
page_width (Optional[int]) -- The width of the emulate device. Default is None.
page_height (Optional[int]) -- The height of the emulate device. Default is None.
proxy (Optional[str]) -- The proxy to use. Default is None.
return_screenshot (bool) -- Whether to return the screenshot. Default is False.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.PlaywrightWebDownloader(cfg)[源代码]#

基类：WebDownloaderBase

Download the web resources using playwright.

async async_download(resources)[源代码]#: Download the web resources asynchronously.

download(resources)[源代码]#

Download the web resources.

参数:: resources (WebResource | list[WebResource]) -- The resources to download.
返回:: The downloaded web resources.
返回类型:: list[WebResource]

Web Reader#

Web reader is used to convert web data into LLM friendly format.

class flexrag.retriever.web_retrievers.WebReaderBase[源代码]#

The base class for the WebReader. The WebReader is used to parse the web resources into a format that can be fed into the LLM.

abstract property fields#: The fields that the reader will return.

abstract read(resources)[源代码]#

Parse the retrieved contexts into LLM readable format.

参数:: resources (list[WebResource]) -- Resources sought from the web.
返回:: Contexts that can be fed into the LLM.
返回类型:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.WebReaderConfig(web_reader_type=None, jina_readerlm_config=<factory>, jina_reader_config=<factory>, screenshot_config=<factory>)#

Configuration class for web_reader (name: WebReaderConfig, default: None).

参数:

web_reader_type (str) -- The web_reader type to use.
jina_readerlm_config (JinaReaderLMConfig) -- The config for JinaReaderLM.
jina_reader_config (JinaReaderConfig) -- The config for JinaReader.
screenshot_config (ScreenshotWebReaderConfig) -- The config for ScreenshotWebReader.

WebReaderConfig is the general configuration for all registered WebReaders. You can load any WebReader by specifying the web_reader_type in the configuration. For example, to load the JinaReader, you can use the following configuration:

from flexrag.retriever.web_retrievers import WebReaderConfig, WEB_READERS

config = WebReaderConfig(
    web_reader_type='jina_reader',
)
seeker = WEB_READERS.load(config)

class flexrag.retriever.web_retrievers.JinaReaderConfig(base_url='https://r.jina.ai', api_key=None, proxy=None)[源代码]#

The configuration for the JinaReader.

参数:

base_url (str) -- The base URL of the Jina Reader API. Default is "https://r.jina.ai".
api_key (str) -- The API key for the Jina Reader API. If not provided, it will use the environment variable JINA_API_KEY. Defaults to None.
proxy (Optional[str]) -- The proxy to use. Defaults to None.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.JinaReader(cfg)[源代码]#

基类：WebReaderBase

The JinaReader parse the web pages using the Jina Reader API.

property fields#: The JinaReader will return the processed_content field.

read(resources)[源代码]#

Parse the retrieved contexts into LLM readable format.

参数:: resources (list[WebResource]) -- Resources sought from the web.
返回:: Contexts that can be fed into the LLM.
返回类型:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.JinaReaderLMConfig(do_sample=True, sample_num=1, temperature=1.0, max_new_tokens=512, top_p=0.9, top_k=50, eos_token_id=None, stop_str=<factory>, web_downloader_type=None, simple_config=<factory>, playwright_config=<factory>, generator_type=None, anthropic_config=<factory>, hf_config=<factory>, hf_vlm_config=<factory>, ollama_config=<factory>, openai_config=<factory>, vllm_config=<factory>, use_v2_prompt=False, pre_clean_html=False, clean_svg=False, clean_base64=False)[源代码]#

The configuration for the JinaReaderLM.

参数:

use_v2_prompt (bool) -- Whether to use the jinaai/ReaderLM-v2 prompt. Default is False.
pre_clean_html (bool) -- Whether to pre-clean the HTML content. Default is False.
clean_svg (bool) -- Whether to clean the SVG content. Default is False.
clean_base64 (bool) -- Whether to clean the base64 images. Default is False.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.JinaReaderLM(cfg)[源代码]#

基类：WebReaderBase

The JinaReaderLM downloads and parses the HTML content using the Jina ReaderLM model.

property fields#: The JinaReaderLM will return the raw_content and processed_content fields.

read(resources)[源代码]#

Parse the retrieved contexts into LLM readable format.

参数:: resources (list[WebResource]) -- Resources sought from the web.
返回:: Contexts that can be fed into the LLM.
返回类型:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.ScreenshotWebReader(cfg)[源代码]#

基类：WebReaderBase

The ScreenshotWebReader reads the web pages by taking screenshots.

property fields#: The ScreenshotWebReader will return the screenshot field.

read(resources)[源代码]#

Parse the retrieved contexts into LLM readable format.

参数:: resources (list[WebResource]) -- Resources sought from the web.
返回:: Contexts that can be fed into the LLM.
返回类型:: list[RetrievedContext]

class flexrag.retriever.web_retrievers.ScreenshotWebReaderConfig(allow_parallel=True, headless=True, browser='chromium', device='Desktop Chrome', page_width=None, page_height=None, proxy=None, return_screenshot=True)[源代码]#

The configuration for the ScreenshotWebReader.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.retriever.web_retrievers.SnippetWebReader[源代码]#

The SnippetWebReader will return the snippet of the resource directly.

This is useful if the resources are retrieved by the SearchEngine, and the snippets are sufficient for the LLM to generate the response.

property fields#: The SnippetWebReader will return the snippet field.

read(resources)[源代码]#

Parse the retrieved contexts into LLM readable format.

参数:: resources (list[WebResource]) -- Resources sought from the web.
返回:: Contexts that can be fed into the LLM.
返回类型:: list[RetrievedContext]

Retrievers

目录

Retrievers#

The Retriever Interface#

Editable Retriever#

Retriever Index#

Web Retriever#

Web Seeker#

Web Downloader#

Web Reader#