Retrievers#
Retrievers are used to retrieve data from the local knowledge base or the web.
The Retriever Interface#
RetrieverBase is the base class for all retrievers,
including the subclasses of EditableRetriever and WebRetrieverBase.
- class flexrag.retriever.RetrieverBaseConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>)[source]#
Base configuration class for all retrievers.
- Parameters:
log_interval (int) – The interval of logging. Default: 100.
top_k (int) – The number of retrieved documents. Default: 10.
batch_size (int) – The batch size for retrieval. Default: 32.
query_preprocess_pipeline (TextProcessPipelineConfig) – The text process pipeline for query. Default: TextProcessPipelineConfig.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.RetrieverBase(cfg)[source]#
The base class for all retrievers. The subclasses should implement the
searchmethod and thefieldsproperty.- abstract property fields#
The fields of the retrieved data.
- abstract search(query, **search_kwargs)[source]#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
- class flexrag.retriever.RetrieverConfig(retriever_type=None, elastic_config=<factory>, flex_config=<factory>, hyde_config=<factory>, typesense_config=<factory>, simple_web_config=<factory>, wikipedia_config=<factory>)#
Configuration class for retriever (name: RetrieverConfig, default: None).
- Parameters:
retriever_type (str) – The retriever type to use.
elastic_config (ElasticRetrieverConfig) – The config for ElasticRetriever.
flex_config (FlexRetrieverConfig) – The config for FlexRetriever.
hyde_config (HydeRetrieverConfig) – The config for HydeRetriever.
typesense_config (TypesenseRetrieverConfig) – The config for TypesenseRetriever.
simple_web_config (SimpleWebRetrieverConfig) – The config for SimpleWebRetriever.
wikipedia_config (WikipediaRetrieverConfig) – The config for WikipediaRetriever.
RetrieverConfig is the general configuration for all registered retrievers.
You can load any retriever by specifying the retriever name in the configuration.
For example, to load the pre-built FlexRetriever retriever,
you can use the following configuration:
from flexrag.retriever import RetrieverConfig, RETRIEVERS, FlexRetrieverConfig
config = RetrieverConfig(
retriever_type='flex',
flex_config=FlexRetrieverConfig(
retriever_path='<path_to_retriever>',
)
)
retriever = RETRIEVERS.load(config)
Editable Retriever#
- class flexrag.retriever.EditableRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>)[source]#
Configuration class for LocalRetriever.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.EditableRetriever(cfg)[source]#
Bases:
RetrieverBaseThe base class for all editable retrievers. In FlexRAG, the
EditableRetrieveris a concept referring to a retriever that includes theadd_passagesandclearmethods, allowing you to build the retriever using your own knowledge base. FlexRAG provides following editable retrievers:FlexRetriever,ElasticRetriever,TypesenseRetriever, andHydeRetriever. The subclasses should implement theadd_passages,clear, and__len__methods.
- class flexrag.retriever.ElasticRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, host='http://localhost:9200', api_key=None, index_name=None, custom_properties=None, verbose=False, retry_times=3, retry_delay=0.5)[source]#
Configuration class for ElasticRetriever.
- Parameters:
host (str) – Host of the ElasticSearch server. Default: “http://localhost:9200”.
api_key (Optional[str]) – API key for the ElasticSearch server. Default: None.
index_name (str) – Name of the index. Required.
custom_properties (Optional[dict]) – Custom properties for building the index. Default: None.
verbose (bool) – Enable verbose logging mode. Default: False.
retry_times (int) – Number of retry times. Default: 3.
retry_delay (float) – Delay time for retry. Default: 0.5.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.ElasticRetriever(cfg)[source]#
Bases:
EditableRetriever- add_passages(**kwargs)#
Add passages to the retriever database.
- Parameters:
passages (Iterable[Context]) – The passages to add.
- Returns:
None
- property fields#
The fields of the retrieved data.
- search(**kwargs)#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
- class flexrag.retriever.TypesenseRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, host='localhost', port=8108, protocol='http', api_key=None, index_name=None, timeout=200.0)[source]#
Configuration class for TypesenseRetriever.
- Parameters:
host (str) – Host of the Typesense server. Default: “localhost”.
port (int) – Port of the Typesense server. Default: 8108.
protocol (str) – Protocol of the Typesense server. Default: “http”. Available options: “https”, “http”.
api_key (str) – API key for the Typesense server. Required.
index_name (str) – Name of the Typesense collection. Required.
timeout (float) – Timeout for the connection. Default: 200.0.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.TypesenseRetriever(cfg)[source]#
Bases:
EditableRetriever- add_passages(**kwargs)#
Add passages to the retriever database.
- Parameters:
passages (Iterable[Context]) – The passages to add.
- Returns:
None
- property fields#
The fields of the retrieved data.
- search(**kwargs)#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
- class flexrag.retriever.LocalRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None)[source]#
The configuration class for LocalRetriever.
- Parameters:
retriever_path (Optional[str]) – The path to the local database. Default: None. If specified, all modifications to the retriever will be applied simultaneously on the disk. If not specified, the retriever will be kept in memory.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.LocalRetriever(cfg)[source]#
Bases:
EditableRetrieverThe base class for all local retrievers.
In FlexRAG, the
LocalRetrieveris a concept referring to a retriever that can be saved to the local disk. The subclasses provide thesave_to_localandload_from_localmethods to save and load the retriever from the local disk, and thesave_to_hubandload_from_hubmethods to save and load the retriever from the HuggingFace Hub.FlexRAG provides following local retrievers:
FlexRetriever, andHydeRetriever.For example, to load a retriever hosted on the HuggingFace Hub, you can run the following code:
from flexrag.retriever import LocalRetriever retriever = LocalRetriever.load_from_hub("flexrag/wiki2021_atlas_bm25s")
To save a retriever to the HuggingFace Hub, you can run the following code:
retriever.save_to_hub("<your-repo-id>", token="<your-token>")
- abstract detach()[source]#
Detach the retriever from the local database. After detaching, the retriever will be kept in memory and all modifications will not be applied to the disk.
- static load_from_hub(repo_id, revision=None, token=None, cache_dir='/home/docs/.cache/flexrag', **kwargs)[source]#
Load a retriever from the HuggingFace Hub.
- Parameters:
repo_id (str) – The repo id of the retriever on the HuggingFace Hub.
revision (str) – The revision of the retriever on the HuggingFace Hub. Default: None.
token (str) – The token to access the HuggingFace Hub. Default: None.
cache_dir (str) – The cache directory to store the retriever. Default: FLEXRAG_CACHE_DIR.
kwargs (Any) – Additional arguments for the retriever.
- Returns:
The loaded retriever.
- Return type:
- static load_from_local(repo_path=None, **kwargs)[source]#
Load a retriever from the local disk.
- Parameters:
repo_path (str) – The path to the local database. Default: None.
- Returns:
The loaded retriever.
- Return type:
- save_to_hub(repo_id, token=None, commit_message='Update FlexRAG retriever', retriever_card=None, private=False, **kwargs)[source]#
Save the retriever to the HuggingFace Hub.
- Parameters:
repo_id (str) – The repo id of the retriever on the HuggingFace Hub.
token (str) – The token to access the HuggingFace Hub. Default: None.
commit_message (str) – The commit message for the retriever. Default: “Update FlexRAG retriever”.
retriever_card (str) – The markdown readme file for the retriever. Default: None.
private (bool) – Whether to create a private repo. Default: False.
kwargs (Any) – Additional arguments for uploading the retriever.
- Returns:
The repo url of the retriever.
- Return type:
str
- class flexrag.retriever.FlexRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None, indexes_merge_method='rrf', indexes_merge_weights=None, used_indexes=None, rrf_base=60)[source]#
Configuration class for FlexRetriever.
- Parameters:
indexes_merge_method (str) – Method to merge the scores of multiple indexes. Available choices are “rrf” and “linear”. Default is “rrf”. * “rrf”: Reciprocal Rank Fusion (RRF) method. * “linear”: Linear combination of the scores.
merge_weights (Optional[list[float]]) – List of weights for each index. Default is None. If None, all indexes will be treated equally. This option is used in both “rrf” and “linear” methods.
used_indexes (Optional[list[str]]) – List of indexes to use for retrieval. Default is None. If None, all indexes will be used.
rrf_base (int) – Base for the RRF method. Default is 60. This option is only used when indexes_merge_method is “rrf”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.FlexRetriever(cfg)[source]#
Bases:
LocalRetrieverFlexRetriever is a retriever implemented by FlexRAG team. FlexRetriever supports multi-index and multi-field retrieval.
- add_passages(**kwargs)#
Add passages to the retriever database.
- Parameters:
passages (Iterable[Context]) – The passages to add.
- Returns:
None
- detach()[source]#
Detach the retriever from the local disk to memory. This function will not delete the database or the indexes.
- property fields#
The fields of the retrieved data.
- remove_index(index_name)[source]#
Remove an index from the retriever.
- Parameters:
index_name (str) – Name of the index.
- Raises:
ValueError – If the index name does not exist.
- Returns:
None
- Return type:
None
- save_to_local(retriever_path=None)[source]#
Save the retriever to the local disk.
- Parameters:
retriever_path (str) – The path to the local database. Default: None.
- Returns:
None
- Return type:
None
- search(**kwargs)#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
- class flexrag.retriever.HydeRetrieverConfig(generator_type=None, anthropic_config=<factory>, hf_config=<factory>, hf_vlm_config=<factory>, ollama_config=<factory>, openai_config=<factory>, vllm_config=<factory>, log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retriever_path=None, indexes_merge_method='rrf', indexes_merge_weights=None, used_indexes=None, rrf_base=60, task='WEB_SEARCH', language='en')#
Configuration class for HydeRetriever.
- Parameters:
task (str) – Task for rewriting the query. Default: “WEB_SEARCH”. Available options: “WEB_SEARCH”, “SCIFACT”, “ARGUANA”, “TREC_COVID”, “FIQA”, “DBPEDIA_ENTITY”, “TREC_NEWS”, “MR_TYDI”.
language (str) – Language for rewriting. Default: “en”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.HydeRetriever(cfg, no_check=False)#
Bases:
FlexRetrieverHydeRetriever is a retriever that rewrites the query before searching.
The original paper is available at https://aclanthology.org/2023.acl-long.99/.
- search(query, **search_kwargs)#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
Retriever Index#
RetrieverIndex is used in FlexRetriever to store and retrieve dense embeddings.
- class flexrag.retriever.index.RetrieverIndexBase[source]#
The base class for all retriever indexes. This class provides the basic interface for building, adding, and searching the index.
The subclass should implement the following methods: - build_index: Build the index from the data. - insert: Add a batch of data to the index. - search: Search for the top_k most similar data indices to the query. - serialize: Serialize the index to the disk. - clear: Clear the index and remove the serialized index files. - __len__: Return the number of data in the index. - is_addable: Return whether the index is addable.
- abstract build_index(data)[source]#
Build the index. The index will be serialized automatically if the index_path is set.
- Parameters:
data (Iterable[Any]) – The data to build the index.
- Returns:
None
- abstract property infimum#
Return the infimum of the similarity scores for the index.
- abstract insert(data, serialize=True)[source]#
Add a batch of data to the index.
- Parameters:
data (list[Any]) – The data to add.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.
- Returns:
None
- insert_batch(data, batch_size=None, serialize=True)[source]#
Add data to the index in batches. This method will automatically perform the serialize method if the index_path is set.
- Parameters:
data (Iterable[Any]) – The data to add.
batch_size (int) – The batch size to add data to the index. Defaults to self.batch_size.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.
- Returns:
None
- static load_from_local(index_path)[source]#
Load the index from the local path.
- Parameters:
index_path (str) – The path to load the index.
- abstract save_to_local(index_path=None)[source]#
Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.
- Parameters:
index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.
- abstract search(query, top_k, **search_kwargs)[source]#
Search for the top_k most similar data indices to the query.
- Parameters:
query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.
- Returns:
The indices and scores of the top_k most similar data indices.
- Return type:
tuple[np.ndarray, np.ndarray]
- abstract property supremum#
Return the supremum of the similarity scores for the index.
- class flexrag.retriever.index.RetrieverIndexBaseConfig(log_interval=10000, batch_size=512, index_path=None)[source]#
The configuration for the RetrieverIndexBase.
- Log_interval:
The interval to log the progress. Defaults to 10000.
- Batch_size:
The batch size to add data to the index. Defaults to 512.
- Parameters:
index_path (Optional[str]) – The path to save the index. If not specified, the index will be kept in memory. Defaults to None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.index.RetrieverIndexConfig(index_type='faiss', bm25_config=<factory>, faiss_config=<factory>, scann_config=<factory>)#
Configuration class for index (name: RetrieverIndexConfig, default: faiss).
- Parameters:
index_type (str) – The index type to use.
bm25_config (BM25IndexConfig) – The config for BM25Index.
faiss_config (FaissIndexConfig) – The config for FaissIndex.
scann_config (ScaNNIndexConfig) – The config for ScaNNIndex.
RetrieverConfig is the general configuration for all registered RetrieverIndexes.
You can load any RetrieverIndex by specifying the index_type in the configuration.
For example, to load the BM25Index, you can use the following configuration:
from flexrag.retriever.index import RetrieverIndexConfig, RETRIEVER_INDEX, BM25IndexConfig
config = RetrieverIndexConfig(
index_type='bm25',
bm25_config=BM25IndexConfig(
index_path='<path_to_index>',
)
)
index = RETRIEVER_INDEX.load(config)
- class flexrag.retriever.index.FaissIndexConfig(log_interval=10000, batch_size=512, index_path=None, query_encoder_config=<factory>, passage_encoder_config=<factory>, distance_function='IP', index_type='auto', n_subquantizers=8, n_bits=8, n_list=1000, factory_str=None, index_train_num=-1, n_probe=None, device_id=<factory>, k_factor=10, polysemous_ht=0, efSearch=100)[source]#
The configuration for the FaissIndex.
- Parameters:
index_type (str) – Building param: the type of the index. Defaults to “auto”. available choices are “FLAT”, “IVF”, “PQ”, “IVFPQ”, and “auto”. If set to “auto”, the index will be set to “IVF{n_list},PQ{embedding_size//2}x4fs”.
n_subquantizers (int) – Building param: the number of subquantizers. Defaults to 8. This parameter is only used when the index type is “PQ” or “IVFPQ”.
n_bits (int) – Building param: the number of bits per subquantizer. Defaults to 8. This parameter is only used when the index type is “PQ” or “IVFPQ”.
n_list (int) – Building param: the number of cells. Defaults to 1000. This parameter is only used when the index type is “IVF” or “IVFPQ”.
factory_str (Optional[str]) – Building param: the factory string to build the index. Defaults to None. If set, the index_type will be ignored.
index_train_num (int) – Building param: the number of data used to train the index. Defaults to -1. If set to -1, all data will be used to train the index.
n_probe (Optional[int]) – Inference param: the number of probes. Defaults to None. If not set, the number of probes will be set to n_list // 8. This parameter is only used when the index type is “IVF” or “IVFPQ”.
device_id (list[int]) – Inference param: the device(s) to use. Defaults to []. [] means CPU. If set, the index will be accelerated with GPU.
k_factor (int) – Inference param: the k factor for search. Defaults to 10.
polysemous_ht (int) – Inference param: the polysemous hash table. Defaults to 0.
efSearch (int) – Inference param: the efSearch for HNSW. Defaults to 100.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.index.FaissIndex(cfg)[source]#
Bases:
DenseIndexBaseFaissIndex employs faiss library to build and search indexes with embeddings. FaissIndex supports both CPU and GPU acceleration. FaissIndex supports various index types, including FLAT, IVF, PQ, IVFPQ, and auto. FaissIndex provides a flexible and efficient way to build and search indexes with embeddings.
- add_embeddings(embeddings)[source]#
A helper function that adds embeddings to the index.
- Parameters:
embeds (np.ndarray) – The embeddings to add.
- Returns:
None
- build_index(data)[source]#
Build the index. The index will be serialized automatically if the index_path is set.
- Parameters:
data (Iterable[Any]) – The data to build the index.
- Returns:
None
- property embedding_size#
Return the embedding size of the index.
- save_to_local(index_path=None)[source]#
Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.
- Parameters:
index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.
- search(query, top_k, **search_kwargs)[source]#
Search for the top_k most similar data indices to the query.
- Parameters:
query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.
- Returns:
The indices and scores of the top_k most similar data indices.
- Return type:
tuple[np.ndarray, np.ndarray]
- class flexrag.retriever.index.ScaNNIndexConfig(log_interval=10000, batch_size=512, index_path=None, query_encoder_config=<factory>, passage_encoder_config=<factory>, distance_function='IP', num_leaves=2000, num_leaves_to_search=500, num_neighbors=10, anisotropic_quantization_threshold=0.2, dimensions_per_block=2, threads=0, index_train_num=0)[source]#
The configuration for the ScaNNIndex.
- Parameters:
num_leaves (int) – The number of leaves in the tree. Defaults to 2000.
num_leaves_to_search (int) – The number of leaves to search. Defaults to 500.
num_neighbors (int) – The number of neighbors to search. Defaults to 10.
anisotropic_quantization_threshold (float) – The anisotropic quantization threshold. Defaults to 0.2.
dimensions_per_block (int) – The number of dimensions per block. Defaults to 2.
threads (int) – The number of threads to use. Defaults to 0 (auto).
index_train_num (int) – The number of samples to train the index. Defaults to 0 (all).
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.index.ScaNNIndex(cfg)[source]#
Bases:
DenseIndexBaseScaNNIndex is a wrapper for the ScaNN library.
ScaNNIndex runs on CPUs with both high speed and accuracy. However, it requires more memory than
FaissIndex.- add_embeddings(embeddings)[source]#
A helper function that adds embeddings to the index.
- Parameters:
embeds (np.ndarray) – The embeddings to add.
- Returns:
None
- build_index(data)[source]#
Build the index. The index will be serialized automatically if the index_path is set.
- Parameters:
data (Iterable[Any]) – The data to build the index.
- Returns:
None
- property embedding_size#
Return the embedding size of the index.
- save_to_local(index_path=None)[source]#
Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.
- Parameters:
index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.
- search(query, top_k, **search_kwargs)[source]#
Search for the top_k most similar data indices to the query.
- Parameters:
query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.
- Returns:
The indices and scores of the top_k most similar data indices.
- Return type:
tuple[np.ndarray, np.ndarray]
- class flexrag.retriever.index.BM25IndexConfig(log_interval=10000, batch_size=512, index_path=None, method='lucene', idf_method=None, backend='auto', k1=1.5, b=0.75, delta=0.5, lang='english')[source]#
Configuration class for BM25Index.
- Parameters:
method (str) – BM25S method. Default: “lucene”. Available options: “atire”, “bm25l”, “bm25+”, “lucene”, “robertson”.
idf_method (Optional[str]) – IDF method. Default: None. Available options: “atire”, “bm25l”, “bm25+”, “lucene”, “robertson”.
backend (str) – Backend for BM25S. Default: “auto”. Available options: “numpy”, “numba”, “auto”.
k1 (float) – BM25S parameter k1. Default: 1.5.
b (float) – BM25S parameter b. Default: 0.75.
delta (float) – BM25S parameter delta. Default: 0.5.
lang (str) – Language for Tokenization. Default: “english”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.index.BM25Index(cfg)[source]#
Bases:
RetrieverIndexBaseBM25Index is a index that retrieves passages using the BM25 algorithm. The implementation is based on the bm25s project.
- build_index(data)[source]#
Build the index. The index will be serialized automatically if the index_path is set.
- Parameters:
data (Iterable[Any]) – The data to build the index.
- Returns:
None
- property infimum#
Return the infimum of the similarity scores for the index.
- insert(data)[source]#
Add a batch of data to the index.
- Parameters:
data (list[Any]) – The data to add.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.
- Returns:
None
- save_to_local(index_path=None)[source]#
Serialize the index to self.index_path. If the index_path is given, the index will be serialized to the index_path.
- Parameters:
index_path (str, optional) – The path to serialize the index. Defaults to self.index_path.
- search(query, top_k, **search_kwargs)[source]#
Search for the top_k most similar data indices to the query.
- Parameters:
query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.
- Returns:
The indices and scores of the top_k most similar data indices.
- Return type:
tuple[np.ndarray, np.ndarray]
- property supremum#
Return the supremum of the similarity scores for the index.
- class flexrag.retriever.index.MultiFieldIndexConfig(indexed_fields, merge_method='max')[source]#
Configuration for MultiFieldIndex.
- Parameters:
indexed_fields (list[str]) – Fields to be indexed. If more than one field is specified, each field will be processed separately and pointed to the same id.
merge_method (str) – The method to merge the scores of the same context id. Available options are “max”, “sum”, “mean”, and “concat”. “max” will take the maximum score of the same context id. “sum” will take the sum of the scores of the same context id. “mean” will take the average of the scores of the same context id. “concat” will concatenate the texts of each field and index them together. Note that “concat” is only available if all indexed fields are of type str. If only one field is specified, this argument will be ignored. Defaults to “max”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.index.MultiFieldIndex(cfg, index)[source]#
Bases:
objectA wrapper index for multiple field contexts.
- build_index(context_ids, data)[source]#
Build the index. The index will be serialized automatically if the index_path is set.
- Parameters:
context_ids (Iterable[str]) – The context ids of the data.
data (Iterable[dict[str, Any]]) – The data to build the index.
- Returns:
None
- insert(context_ids, data, serialize=True)[source]#
Add a batch of data to the index.
- Parameters:
context_ids (list[str]) – The context ids of the data.
data (list[dict[str, Any]]) – The data to add.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.
- Returns:
None
- insert_batch(context_ids, data, batch_size=None, serialize=True)[source]#
Add data to the index in batches. This method will automatically perform the serialize method if the index_path is set.
- Parameters:
context_ids (Iterable[str]) – The context ids of the data.
data (Iterable[dict[str, Any]]) – The data to add.
batch_size (int) – The batch size to add data to the index. Defaults to self.batch_size.
serialize (bool) – Whether to serialize the index after adding data. Defaults to True.
- Returns:
None
- property is_addable#
Check if the index is addable.
- save_to_local(index_path=None)[source]#
Serialize the index to the given path.
- Parameters:
index_path (str) – The path to save the index. If None, the index will be saved to self.index.cfg.index_path.
- Returns:
None
- search(query, top_k, **search_kwargs)[source]#
Search for the top_k most similar data indices to the query.
- Parameters:
query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
search_kwargs (Any) – Additional search arguments.
- Returns:
The indices and scores of the top_k most similar data indices.
- Return type:
tuple[list[list[str]], np.ndarray]
- search_batch(query, top_k, **search_kwargs)[source]#
Search for the top_k most similar data indices to the query. This method will search the index in batches.
- Parameters:
query (list[Any]) – The query data.
top_k (int, optional) – The number of most similar data indices to return, defaults to 10.
batch_size (Optional[int]) – The batch size to search. Defaults to self.batch_size.
search_kwargs (Any) – Additional search arguments.
- Returns:
The indices and scores of the top_k most similar data indices.
- Return type:
tuple[list[list[str]], np.ndarray]
Web Retriever#
WebRetriever is used to retrieve data from the web. Different from the EditableRetriever, web retrievers can be used without building a knowledge base, as they retrieve data using web search engines.
- class flexrag.retriever.web_retrievers.WebRetrieverBaseConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retry_times=3, retry_delay=0.5)[source]#
The configuration for the
WebRetrieverBase.- Parameters:
retry_times (int) – The number of times to retry. Default is 3.
retry_delay (float) – The delay between retries. Default is 0.5.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.WebRetrieverBase(cfg)[source]#
The base class for the
WebRetriever.The WebRetriever is used to retrieve relevant information from the web. The subclasses should implement the
search_itemmethod.- async async_search(query, **search_kwargs)#
Search queries asynchronously.
- abstract property fields#
The fields of the retrieved data.
- search(**kwargs)#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
- abstract search_item(query, top_k, **search_kwargs)[source]#
Search the query from the web.
- Parameters:
query (str) – The query to search.
top_k (int) – The number of documents to return.
- Returns:
The retrieved contexts.
- Return type:
list[RetrievedContext]
- test_speed(sample_num=10000, test_times=10, **search_kwargs)#
Test the speed of the retriever.
- Parameters:
sample_num (int, optional) – The number of samples to test.
test_times (int, optional) – The number of times to test.
- Returns:
The time consumed for retrieval.
- Return type:
float
- class flexrag.retriever.web_retrievers.WebResource(url, query=None, metadata=<factory>, data=None)[source]#
The web resource dataclass.
WebResourceis the fundamental component for information transmission in theweb_retrieversmodule of FlexRAG. TheWebSeekerretrieves the correspondingWebResourcebased on the user’s query, while theWebDownloaderdownloads the resource based on the URL in theWebResourceand stores it in thedatafield of theWebResource. TheWebReaderthen converts thedatafield of theWebResourceinto a LLM friendly format and returns theRetrievedContext.- Parameters:
url (str) – The URL of the resource.
query (Optional[str]) – The query for the resource. Default is None.
metadata (dict) – The metadata of the resource, offen provided by the WebSeeker. Default is {}.
data (Any) – The content of the resource, offen filled by the WebDownloader. Default is None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
FlexRAG provides two simple web retrievers, SimpleWebRetriever and WikipediaRetriever.
- class flexrag.retriever.SimpleWebRetrieverConfig(search_engine_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>, web_reader_type=None, jina_readerlm_config=<factory>, jina_reader_config=<factory>, screenshot_config=<factory>, log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, retry_times=3, retry_delay=0.5)[source]#
The configuration for the
SimpleWebRetriever.- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.SimpleWebRetriever(cfg)[source]#
Bases:
WebRetrieverBaseSimpleWebRetriever seeks most relevant web pages using existing search engine and reads the content using the WebReader.
- property fields#
The fields of the retrieved data.
- search_item(query, top_k=10, **search_kwargs)[source]#
Search the query from the web.
- Parameters:
query (str) – The query to search.
top_k (int) – The number of documents to return.
- Returns:
The retrieved contexts.
- Return type:
list[RetrievedContext]
- class flexrag.retriever.WikipediaRetrieverConfig(log_interval=100, top_k=10, batch_size=32, query_preprocess_pipeline=<factory>, search_url='https://en.wikipedia.org/w/index.php?search=', proxy=None)[source]#
The configuration for the
WikipediaRetriever.- Parameters:
search_url (str) – The search URL for Wikipedia. Default is “https://en.wikipedia.org/w/index.php?search=”.
proxy (Optional[str]) – The proxy to use. Default is None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.WikipediaRetriever(cfg)[source]#
Bases:
RetrieverBaseWikipediaRetriever retrieves information from Wikipedia directly. Adapted from ysymyth/ReAct
- property fields#
The fields of the retrieved data.
- search(query, delay=0.1, **search_kwargs)[source]#
Search a batch of queries.
- Parameters:
query (list[Any] | Any) – Queries to search.
search_kwargs (Any) – Keyword arguments, contains other search arguments.
- Returns:
A batch of list that contains k RetrievedContext.
- Return type:
list[list[RetrievedContext]]
Web Seeker#
WebSeeker is used to search the resources from the web for the given query.
The web resources could be sought by walking through a set of given web pages, by using a search engine, etc.
FlexRAG provides several web seekers using existing search engines.
- class flexrag.retriever.web_retrievers.WebSeekerBase[source]#
The base class for the WebSeeker. The WebSeeker is used to seek the web resources for a given query. The web resources could be sought by walking through a set of given web pages, by using a search engine, etc.
The subclasses should implement the
seekmethod.- abstract seek(query, top_k=10, **kwargs)[source]#
Seek the web resources.
- Parameters:
query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.
- Returns:
The web resources.
- Return type:
list[WebResource]
- class flexrag.retriever.web_retrievers.WebSeekerConfig(web_seeker_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>)#
Configuration class for web_seeker (name: WebSeekerConfig, default: None).
- Parameters:
web_seeker_type (str) – The web_seeker type to use.
bing_config (BingEngineConfig) – The config for BingEngine.
ddg_config (DuckDuckGoEngineConfig) – The config for DuckDuckGoEngine.
google_config (GoogleEngineConfig) – The config for GoogleEngine.
serpapi_config (SerpApiConfig) – The config for SerpApi.
WebSeekerConfig is the general configuration for all registered WebSeekers.
You can load any WebSeekers by specifying the web_seeker_type in the configuration.
For example, to load the DuckDuckGoEngine, you can use the following configuration:
from flexrag.retriever.web_retrievers import WebSeekerConfig, WEB_SEEKERS
config = WebSeekerConfig(
web_seeker_type='ddg',
)
seeker = WEB_SEEKERS.load(config)
- class flexrag.retriever.web_retrievers.SearchEngineConfig(search_engine_type=None, bing_config=<factory>, ddg_config=<factory>, google_config=<factory>, serpapi_config=<factory>)#
Configuration class for search_engine (name: SearchEngineConfig, default: None).
- Parameters:
search_engine_type (str) – The search_engine type to use.
bing_config (BingEngineConfig) – The config for BingEngine.
ddg_config (DuckDuckGoEngineConfig) – The config for DuckDuckGoEngine.
google_config (GoogleEngineConfig) – The config for GoogleEngine.
serpapi_config (SerpApiConfig) – The config for SerpApi.
SearchEngine is a type of WebSeeker that searches for web resources by leveraging existing search engines.
SearchEngineConfig is the general configuration for all registered SearchEngines.
You can load any SearchEngines by specifying the search_engine_type in the configuration.
For example, to load the DuckDuckGoEngine, you can use the following configuration:
from flexrag.retriever.web_retrievers import SearchEngineConfig, SEARCH_ENGINES
config = SearchEngineConfig(
search_engine_type='ddg',
)
seeker = SEARCH_ENGINES.load(config)
- class flexrag.retriever.web_retrievers.BingEngineConfig(subscription_key='EMPTY', base_url='https://api.bing.microsoft.com/v7.0/search', timeout=3.0, market='en-US', lang='en', freshness=None)[source]#
The configuration for the
BingSeeker.- Parameters:
subscription_key (str) – The subscription key for the Bing Search API. Default is os.environ.get(“BING_SEARCH_KEY”, “EMPTY”).
base_url (str) – The base_url for the Bing Search API. Default is “https://api.bing.microsoft.com/v7.0/search”.
timeout (float) – The timeout for the requests. Default is 3.0.
market (str) – The market to use. see https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes for more information). Default is en-US.
lang (str) – The language to use. Default is “en”.
freshness (Optional[str]) – To get articles discovered by Bing during a specific timeframe, specify a date range in the form, YYYY-MM-DD..YYYY-MM-DD. For example, &freshness=2019-02-01..2019-05-30. Default is None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.BingEngine(cfg)[source]#
Bases:
WebSeekerBaseThe BingSeeker retrieves the web pages using the Bing Search API.
- seek(query, top_k=10, **search_kwargs)[source]#
Seek the web resources.
- Parameters:
query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.
- Returns:
The web resources.
- Return type:
list[WebResource]
- class flexrag.retriever.web_retrievers.DuckDuckGoEngineConfig(proxy=None)[source]#
The configuration for the
DuckDuckGoEngine.- Parameters:
proxy (Optional[str]) – The proxy to use. Default is None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.DuckDuckGoEngine(cfg)[source]#
Bases:
WebSeekerBaseThe DuckDuckGoEngine retrieves the web pages using the DuckDuckGo Search API.
- seek(query, top_k=10, **search_kwargs)[source]#
Seek the web resources.
- Parameters:
query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.
- Returns:
The web resources.
- Return type:
list[WebResource]
- class flexrag.retriever.web_retrievers.GoogleEngineConfig(subscription_key=None, search_engine_id=None, endpoint='https://customsearch.googleapis.com/customsearch/v1', proxy=None, timeout=3.0)[source]#
The configuration for the
GoogleEngine.- Parameters:
subscription_key (str) – The subscription key for the Google Search API. If not provided, it will use the environment variable GOOGLE_SEARCH_KEY. Defaults to None.
search_engine_id (str) – The search engine id for the Google Search API. If not provided, it will use the environment variable GOOGLE_SEARCH_ENGINE_ID. Defaults to None.
endpoint (str) – The endpoint for the Google Search API. Default is “https://customsearch.googleapis.com/customsearch/v1”.
proxy (Optional[str]) – The proxy to use. Default is None.
timeout (float) – The timeout for the requests. Default is 3.0.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.GoogleEngine(cfg)[source]#
Bases:
WebSeekerBaseThe GoogleEngine retrieves the web pages using the Google Custom Search API.
- seek(query, top_k=10, **search_kwargs)[source]#
Seek the web resources.
- Parameters:
query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.
- Returns:
The web resources.
- Return type:
list[WebResource]
- class flexrag.retriever.web_retrievers.SerpApiConfig(api_key=None, engine='google', country='us', language='en')[source]#
The configuration for the
SerpApi.- Parameters:
api_key (str) – The API key for the SerpApi. If not provided, it will use the environment variable SERPAPI_API_KEY. Defaults to None.
engine (str) – The search engine to use. Default is “google”. Available choices are “google”, “bing”, “baidu”, “yandex”, “yahoo”, “google_scholar”, “duckduckgo”.
country (str) – The country to search. Default is “us”.
language (str) – The language to search. Default is “en”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.SerpApi(cfg)[source]#
Bases:
WebSeekerBaseThe SerpApi retrieves the web pages using the SerpApi <https://serpapi.com/>_.
- seek(query, top_k=10, **search_kwargs)[source]#
Seek the web resources.
- Parameters:
query (str) – The query to seek.
top_k (int) – The number of resources to seek. Default is 10.
kwargs – The additional keyword arguments.
- Returns:
The web resources.
- Return type:
list[WebResource]
Web Downloader#
Web downloader is used to download data from the web.
- class flexrag.retriever.web_retrievers.WebDownloaderBaseConfig(allow_parallel=True)[source]#
The configuration for the
WebDownloaderBase.- Parameters:
allow_parallel (bool) – Whether to allow parallel downloading. Default is True.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.WebDownloaderBase(cfg)[source]#
The base class for the
WebDownloader.- download(resources)[source]#
Download the web resources.
- Parameters:
resources (WebResource | list[WebResource]) – The resources to download.
- Returns:
The downloaded web resources.
- Return type:
list[WebResource]
- class flexrag.retriever.web_retrievers.SimpleWebDownloaderConfig(allow_parallel=True, proxy=None, timeout=3.0, headers=None)[source]#
The configuration for the
SimpleWebDownloader.- Parameters:
proxy (Optional[str]) – The proxy to use. Default is None.
timeout (float) – The timeout for the requests. Default is 3.0.
headers (Optional[dict]) – The headers to use. Default is None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.SimpleWebDownloader(cfg)[source]#
Bases:
WebDownloaderBaseDownload the html content using httpx.
- class flexrag.retriever.web_retrievers.PlaywrightWebDownloaderConfig(allow_parallel=True, headless=True, browser='chromium', device='Desktop Chrome', page_width=None, page_height=None, proxy=None, return_screenshot=False)[source]#
The configuration for the
PlaywrightWebDownloader.- Parameters:
headless (bool) – Whether to run the browser in headless mode. Default is True.
browser (str) – The browser to use. Default is chromium. Available choices are chromium, firefox, webkit, and msedge.
device (str) – The device to emulate. Default is Desktop Chrome.
page_width (Optional[int]) – The width of the emulate device. Default is None.
page_height (Optional[int]) – The height of the emulate device. Default is None.
proxy (Optional[str]) – The proxy to use. Default is None.
return_screenshot (bool) – Whether to return the screenshot. Default is False.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.PlaywrightWebDownloader(cfg)[source]#
Bases:
WebDownloaderBaseDownload the web resources using playwright.
- download(resources)[source]#
Download the web resources.
- Parameters:
resources (WebResource | list[WebResource]) – The resources to download.
- Returns:
The downloaded web resources.
- Return type:
list[WebResource]
Web Reader#
Web reader is used to convert web data into LLM friendly format.
- class flexrag.retriever.web_retrievers.WebReaderBase[source]#
The base class for the
WebReader. The WebReader is used to parse the web resources into a format that can be fed into the LLM.- abstract property fields#
The fields that the reader will return.
- abstract read(resources)[source]#
Parse the retrieved contexts into LLM readable format.
- Parameters:
resources (list[WebResource]) – Resources sought from the web.
- Returns:
Contexts that can be fed into the LLM.
- Return type:
list[RetrievedContext]
- class flexrag.retriever.web_retrievers.WebReaderConfig(web_reader_type=None, jina_readerlm_config=<factory>, jina_reader_config=<factory>, screenshot_config=<factory>)#
Configuration class for web_reader (name: WebReaderConfig, default: None).
- Parameters:
web_reader_type (str) – The web_reader type to use.
jina_readerlm_config (JinaReaderLMConfig) – The config for JinaReaderLM.
jina_reader_config (JinaReaderConfig) – The config for JinaReader.
screenshot_config (ScreenshotWebReaderConfig) – The config for ScreenshotWebReader.
WebReaderConfig is the general configuration for all registered WebReaders.
You can load any WebReader by specifying the web_reader_type in the configuration.
For example, to load the JinaReader, you can use the following configuration:
from flexrag.retriever.web_retrievers import WebReaderConfig, WEB_READERS
config = WebReaderConfig(
web_reader_type='jina_reader',
)
seeker = WEB_READERS.load(config)
- class flexrag.retriever.web_retrievers.JinaReaderConfig(base_url='https://r.jina.ai', api_key=None, proxy=None)[source]#
The configuration for the
JinaReader.- Parameters:
base_url (str) – The base URL of the Jina Reader API. Default is “https://r.jina.ai”.
api_key (str) – The API key for the Jina Reader API. If not provided, it will use the environment variable JINA_API_KEY. Defaults to None.
proxy (Optional[str]) – The proxy to use. Defaults to None.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.JinaReader(cfg)[source]#
Bases:
WebReaderBaseThe JinaReader parse the web pages using the Jina Reader API.
- property fields#
The
JinaReaderwill return theprocessed_contentfield.
- read(resources)[source]#
Parse the retrieved contexts into LLM readable format.
- Parameters:
resources (list[WebResource]) – Resources sought from the web.
- Returns:
Contexts that can be fed into the LLM.
- Return type:
list[RetrievedContext]
- class flexrag.retriever.web_retrievers.JinaReaderLMConfig(do_sample=True, sample_num=1, temperature=1.0, max_new_tokens=512, top_p=0.9, top_k=50, eos_token_id=None, stop_str=<factory>, web_downloader_type=None, simple_config=<factory>, playwright_config=<factory>, generator_type=None, anthropic_config=<factory>, hf_config=<factory>, hf_vlm_config=<factory>, ollama_config=<factory>, openai_config=<factory>, vllm_config=<factory>, use_v2_prompt=False, pre_clean_html=False, clean_svg=False, clean_base64=False)[source]#
The configuration for the
JinaReaderLM.- Parameters:
use_v2_prompt (bool) – Whether to use the jinaai/ReaderLM-v2 prompt. Default is False.
pre_clean_html (bool) – Whether to pre-clean the HTML content. Default is False.
clean_svg (bool) – Whether to clean the SVG content. Default is False.
clean_base64 (bool) – Whether to clean the base64 images. Default is False.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.JinaReaderLM(cfg)[source]#
Bases:
WebReaderBaseThe JinaReaderLM downloads and parses the HTML content using the Jina ReaderLM model.
- property fields#
The
JinaReaderLMwill return theraw_contentandprocessed_contentfields.
- read(resources)[source]#
Parse the retrieved contexts into LLM readable format.
- Parameters:
resources (list[WebResource]) – Resources sought from the web.
- Returns:
Contexts that can be fed into the LLM.
- Return type:
list[RetrievedContext]
- class flexrag.retriever.web_retrievers.ScreenshotWebReader(cfg)[source]#
Bases:
WebReaderBaseThe ScreenshotWebReader reads the web pages by taking screenshots.
- property fields#
The
ScreenshotWebReaderwill return thescreenshotfield.
- read(resources)[source]#
Parse the retrieved contexts into LLM readable format.
- Parameters:
resources (list[WebResource]) – Resources sought from the web.
- Returns:
Contexts that can be fed into the LLM.
- Return type:
list[RetrievedContext]
- class flexrag.retriever.web_retrievers.ScreenshotWebReaderConfig(allow_parallel=True, headless=True, browser='chromium', device='Desktop Chrome', page_width=None, page_height=None, proxy=None, return_screenshot=True)[source]#
The configuration for the
ScreenshotWebReader.- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.retriever.web_retrievers.SnippetWebReader[source]#
The SnippetWebReader will return the snippet of the resource directly.
This is useful if the resources are retrieved by the
SearchEngine, and the snippets are sufficient for the LLM to generate the response.- property fields#
The
SnippetWebReaderwill return thesnippetfield.
- read(resources)[source]#
Parse the retrieved contexts into LLM readable format.
- Parameters:
resources (list[WebResource]) – Resources sought from the web.
- Returns:
Contexts that can be fed into the LLM.
- Return type:
list[RetrievedContext]