Chunking#
This module provides a set of classes for chunking a long text into smaller chunks.
The Chunker Interface#
ChunkerBase is the base class for all chunkers. It provides a simple interface for chunking a text into smaller chunks. The chunking process is controlled by a configuration object that is passed to the chunker’s constructor.
Chunkers#
- class flexrag.chunking.CharChunkerConfig(max_chars=2048, overlap=0)[source]#
Configuration for CharChunker.
- Parameters:
max_chars (int) – The number of characters in each chunk. Default is 2048.
overlap (int) – The number of characters to overlap between chunks. Default is 0.
For example, to chunk a text into chunks with 1024 characters with 128 characters overlap:
from flexrag.chunking import CharChunkerConfig, CharChunker cfg = CharChunkerConfig(max_chars=1024, overlap=128) chunker = CharChunker(cfg)
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.CharChunker(cfg)[source]#
Bases:
ChunkerBaseCharChunker splits text into chunks with fixed length of characters.
- class flexrag.chunking.TokenChunkerConfig(tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, max_tokens=512, overlap=0)[source]#
Bases:
tokenizer_configConfiguration for TokenChunker.
- Parameters:
max_tokens (int) – The number of tokens in each chunk. Default is 512.
overlap (int) – The number of tokens to overlap between chunks. Default is 0.
For example, to chunk a text into chunks with 256 tokens with 128 tokens overlap:
from flexrag.chunking import TokenChunkerConfig, TokenChunker from flexrag.models.tokenizer import TikTokenTokenizerConfig cfg = TokenChunkerConfig( max_tokens=256, overlap=128, tokenizer_type="tiktoken", tiktoken_config=TikTokenTokenizerConfig(model_name="gpt-4o"), ) chunker = TokenChunker(cfg)
Note that the
TokenChunkerrelies on thetokenizeanddetokenizemethods of the tokenizer to split the text. Thus the space between may be lost if the tokenizer is not reversible.- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.TokenChunker(cfg)[source]#
Bases:
ChunkerBaseTokenChunker splits text into chunks with fixed number of tokens.
- class flexrag.chunking.RecursiveChunkerConfig(tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, max_tokens=512, split_pattern=<factory>)[source]#
Bases:
tokenizer_configConfiguration for RecursiveChunker.
- Parameters:
max_tokens (int) – The maximum number of tokens in each chunk. Default is 512.
seperators (dict[str, str]) – The seperators used to split text recursively. The order of the seperators matters. Default is
PREDEFINED_SPLIT_PATTERNS["en"].
For example, to split a text recursively with 256 tokens in each chunk:
from flexrag.chunking import RecursiveChunkerConfig, RecursiveChunker cfg = RecursiveChunkerConfig(max_tokens=256) chunker = RecursiveChunker(cfg)
You can also specify your own seperator list:
from flexrag.chunking import RecursiveChunkerConfig, RecursiveChunker cfg = RecursiveChunkerConfig( max_tokens=256, split_pattern={"level1": "pattern1", "level2": "pattern2"}, ) chunker = RecursiveChunker(cfg)
Note that the
RecursiveChunkerrelies on the regex pattern to split the text, thus you need to make sure your pattern will not consume the splitter. A good practice is to use the lookbehind and lookahead assertion to avoid consuming the splitter.- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.RecursiveChunker(cfg)[source]#
Bases:
ChunkerBaseRecursiveChunker splits text into chunks recursively using the specified seperators.
The order of the seperators matters. The text will be split recursively based on the seperators in the order of the list. The default seperators are defined in
PREDEFINED_SPLIT_PATTERNS.If the text is still too long after splitting with the last level seperators, the text will be split into tokens.
- class flexrag.chunking.SentenceChunkerConfig(sentence_splitter_type='regex', nltk_splitter_config=<factory>, regex_config=<factory>, spacy_config=<factory>, tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, max_sents=None, max_tokens=None, max_chars=None, overlap=0)[source]#
Bases:
tokenizer_config,SentenceSplitterConfigConfiguration for SentenceChunker.
- Parameters:
max_sents (Optional[int]) – The maximum number of sentences in each chunk. Default is None.
max_tokens (Optional[int]) – The maximum number of tokens in each chunk. Default is None.
max_chars (Optional[int]) – The maximum number of characters in each chunk. Default is None.
overlap (int) – The number of sentences to overlap between chunks. Default is 0.
For example, to chunk a text into chunks with 10 sentences in each chunk:
from flexrag.chunking import SentenceChunkerConfig, SentenceChunker cfg = SentenceChunkerConfig(max_sents=10) chunker = SentenceChunker(cfg)
Note that the
SentenceChunkerrelies on the sentence splitter to split the text, thus the space between may be lost if the sentence splitter is not reversible.- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.SentenceChunker(cfg)[source]#
Bases:
ChunkerBaseSentenceChunker first splits text into sentences using the specified sentence splitter, then merges the sentences into chunks based on the specified constraints.
- class flexrag.chunking.SemanticChunkerConfig(tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, encoder_type=None, cohere_config=<factory>, hf_clip_config=<factory>, jina_config=<factory>, ollama_config=<factory>, openai_config=<factory>, sentence_transformer_config=<factory>, sentence_splitter_type='regex', nltk_splitter_config=<factory>, regex_config=<factory>, spacy_config=<factory>, max_tokens=None, max_tokens_per_sentence=None, threshold=None, threshold_percentile=None, similarity_window=1, similarity_function='COS')[source]#
Bases:
SentenceSplitterConfig,EncoderConfig,tokenizer_configConfiguration for SemanticChunker.
- Parameters:
max_tokens (Optional[int]) – The maximum number of tokens in each chunk. Default is None.
threshold (Optional[float]) – The threshold for semantic similarity. Default is None. If provided, the threshold_percentile and max_tokens will be ignored.
threshold_percentile (Optional[float]) – The ratio of the threshold for semantic similarity. Default is None. Should be a value between 0 and 100. Higher values will result in more chunks. 5 is a good starting point. If provided, the max_tokens will be ignored.
similarity_window (Optional[int]) – The window size for calculating semantic similarity. Default is None.
similarity_function (str) – The similarity function to use. Default is “COS”. Available choices are “L2” for the reciprocal of euclidean distance, “IP” for inner product, and “COS” for cosine similarity.
The similarity higher than the threshold will be considered as coherent, and the chunks will be split at the points where the similarity is below the threshold. Thus, at least one of max_tokens, threshold, or threshold_percentile should be provided. If threshold is provided, the chunks will be split directly based on the threshold. If threshold_percentile is provided, the threshold will be calculated automatically based on the similarity distribution. If max_tokens is provided, the threshold will be calculated to ensure the chunks are within the token limit.
For example, to split the text into chunks with a maximum of 512 tokens, you can use the following configuration:
>>> from flexrag.chunking import SemanticChunker, SemanticChunkerConfig >>> from flexrag.models import HFEncoderConfig >>> config = SemanticChunkerConfig( ... max_tokens=512, ... encoder_type="hf", ... hf_config=HFEncoderConfig(model_path="BAAI/bge-small-en-v1.5"), ... ) >>> chunker = SemanticChunker(config)
To split the text into chunks with a threshold_percentile of 5%, you can use the following configuration:
>>> config = SemanticChunkerConfig( ... threshold_percentile=5, ... encoder_type="hf", ... hf_config=HFEncoderConfig(model_path="BAAI/bge-small-en-v1.5"), ... ) >>> chunker = SemanticChunker(config)
To split the text into chunks with a given threshold, you can use the following configuration:
>>> config = SemanticChunkerConfig( ... threshold=0.8, ... encoder_type="hf", ... hf_config=HFEncoderConfig(model_path="BAAI/bge-small-en-v1.5"), ... ) >>> chunker = SemanticChunker(config)
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.SemanticChunker(cfg)[source]#
Bases:
ChunkerBaseSemanticChunker splits text into sentences and then groups them into chunks based on semantic similarity. This chunker is inspired by the Greg Kamradt’s wonderful notebook: FullStackRetrieval-com/RetrievalTutorials
Sentence Splitters#
This submodule provides a set of useful tools for splitting a text into sentences.
- class flexrag.chunking.sentence_splitter.SentenceSplitterBase[source]#
Sentence splitter that splits text into sentences. This is an abstract class that defines the interface for all sentence splitters. The subclasses should implement the split method to split the text. The reversible property should return True if the splitted sentences can be concatenate back to the original text.
- abstract property reversible#
return True if the splitted sentences can be concatenate back to the original text.
- class flexrag.chunking.sentence_splitter.NLTKSentenceSplitterConfig(language='english')[source]#
Configuration for NLTKSentenceSplitter.
- Parameters:
language (str) – The language to use for the sentence splitter. Default is “english”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.sentence_splitter.NLTKSentenceSplitter(cfg)[source]#
Bases:
SentenceSplitterBaseNLTKSentenceSplitter splits text into sentences using NLTK’s PunktSentenceTokenizer. For more information, see https://www.nltk.org/api/nltk.tokenize.punkt.html#module-nltk.tokenize.punkt.
- property reversible#
NLTKSentenceSplitter is not reversible as it may lose spaces between sentences.
- class flexrag.chunking.sentence_splitter.RegexSplitterConfig(pattern='(?<=[.?!])')[source]#
Configuration for RegexSentenceSplitter.
- Parameters:
pattern (str) – The regular expression pattern to split the text. Default is
PREDEFINED_SPLIT_PATTERNS["en"]["sentence"]
Note that some patterns may lose the seperators between sentences. A good practice is to use the lookbehind and lookahead assertion to avoid consuming the splitter.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.sentence_splitter.RegexSplitter(cfg)[source]#
Bases:
SentenceSplitterBaseRegexSentenceSplitter splits text into sentences using a regular expression pattern.
Note that this splitter uses the regex module, which might be slightly different from the built-in re module.
- property reversible#
The default RegexSplitter is reversible. However, the reversibility depends on the pattern used.
- class flexrag.chunking.sentence_splitter.SpacySentenceSplitterConfig(model='en_core_web_sm')[source]#
Configuration for SpacySentenceSplitter.
- Parameters:
model (str) – The spaCy model to use for sentence splitting. Default is “en_core_web_sm”.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.chunking.sentence_splitter.SpacySentenceSplitter(cfg)[source]#
Bases:
SentenceSplitterBaseSpacySentenceSplitter splits text into sentences using spaCy’s sentence splitter.
- property reversible#
SpacySentenceSplitter is not reversible as it may lose spaces between sentences.
- sentence_splitter.PREDEFINED_SPLIT_PATTERNS = {'en': {'big_paragraph': '(?<=\\R{2,})', 'paragraph': '(?<=\\R)', 'sentence': '(?<=[.?!])', 'subsentence': '(?<=[,;\\"\'{}<>\\[\\]`~])', 'word': '(?<=\\s)'}, 'zh': {'big_paragraph': '(?<=\\R{2,})', 'paragraph': '(?<=\\R)', 'setence': '(?<=[。!?])', 'subsentence': '(?<=[,;:“”‘’《》【】、])'}}#
A dictionary of predefined sentence splitting patterns. The keys are the names of the patterns, and the values are the corresponding regular expressions. Currently,
FlexRAGprovides 2 sets of predefined patterns: “en” for English and “zh” for Chinese. Please refer to the source code for more details.
General Configuration#
The configuration provides a general interface for loading and configurate the chunker or the sentence splitter.
- class flexrag.chunking.ChunkerConfig(chunker_type='sentence_chunker', char_chunker_config=<factory>, token_chunker_config=<factory>, recursive_chunker_config=<factory>, sentence_chunker_config=<factory>, semantic_chunker_config=<factory>)#
Configuration class for chunker (name: ChunkerConfig, default: sentence_chunker).
- Parameters:
chunker_type (str) – The chunker type to use.
char_chunker_config (CharChunkerConfig) – The config for CharChunker.
token_chunker_config (TokenChunkerConfig) – The config for TokenChunker.
recursive_chunker_config (RecursiveChunkerConfig) – The config for RecursiveChunker.
sentence_chunker_config (SentenceChunkerConfig) – The config for SentenceChunker.
semantic_chunker_config (SemanticChunkerConfig) – The config for SemanticChunker.
- class flexrag.chunking.sentence_splitter.SentenceSplitterConfig(sentence_splitter_type='regex', nltk_splitter_config=<factory>, regex_config=<factory>, spacy_config=<factory>)#
Configuration class for sentence_splitter (name: SentenceSplitterConfig, default: regex).
- Parameters:
sentence_splitter_type (str) – The sentence_splitter type to use.
nltk_splitter_config (NLTKSentenceSplitterConfig) – The config for NLTKSentenceSplitter.
regex_config (RegexSplitterConfig) – The config for RegexSplitter.
spacy_config (SpacySentenceSplitterConfig) – The config for SpacySentenceSplitter.