Chunking

Contents

Chunking#

This module provides a set of classes for chunking a long text into smaller chunks.

The Chunker Interface#

ChunkerBase is the base class for all chunkers. It provides a simple interface for chunking a text into smaller chunks. The chunking process is controlled by a configuration object that is passed to the chunker’s constructor.

class flexrag.chunking.ChunkerBase[source]#

Chunker that splits text into chunks of fixed size. This is an abstract class that defines the interface for all chunkers. The subclasses should implement the chunk method to split the text.

abstract chunk(text, return_str=False)[source]#

Chunk the given text into smaller chunks.

Parameters:
  • text (str) – The text to chunk.

  • return_str (bool) – If True, return the chunks as strings instead of Chunk objects. Default is False.

Returns:

The chunks of the text.

Return type:

list[Chunk]

Chunkers#

class flexrag.chunking.CharChunkerConfig(max_chars=2048, overlap=0)[source]#

Configuration for CharChunker.

Parameters:
  • max_chars (int) – The number of characters in each chunk. Default is 2048.

  • overlap (int) – The number of characters to overlap between chunks. Default is 0.

For example, to chunk a text into chunks with 1024 characters with 128 characters overlap:

from flexrag.chunking import CharChunkerConfig, CharChunker

cfg = CharChunkerConfig(max_chars=1024, overlap=128)
chunker = CharChunker(cfg)
dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.CharChunker(cfg)[source]#

Bases: ChunkerBase

CharChunker splits text into chunks with fixed length of characters.

chunk(text, return_str=False)[source]#

Chunk the given text into smaller chunks.

Parameters:
  • text (str) – The text to chunk.

  • return_str (bool) – If True, return the chunks as strings instead of Chunk objects. Default is False.

Returns:

The chunks of the text.

Return type:

list[Chunk]

class flexrag.chunking.TokenChunkerConfig(tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, max_tokens=512, overlap=0)[source]#

Bases: tokenizer_config

Configuration for TokenChunker.

Parameters:
  • max_tokens (int) – The number of tokens in each chunk. Default is 512.

  • overlap (int) – The number of tokens to overlap between chunks. Default is 0.

For example, to chunk a text into chunks with 256 tokens with 128 tokens overlap:

from flexrag.chunking import TokenChunkerConfig, TokenChunker
from flexrag.models.tokenizer import TikTokenTokenizerConfig

cfg = TokenChunkerConfig(
    max_tokens=256,
    overlap=128,
    tokenizer_type="tiktoken",
    tiktoken_config=TikTokenTokenizerConfig(model_name="gpt-4o"),
)
chunker = TokenChunker(cfg)

Note that the TokenChunker relies on the tokenize and detokenize methods of the tokenizer to split the text. Thus the space between may be lost if the tokenizer is not reversible.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.TokenChunker(cfg)[source]#

Bases: ChunkerBase

TokenChunker splits text into chunks with fixed number of tokens.

chunk(text, return_str=False)[source]#

Chunk the given text into smaller chunks.

Parameters:
  • text (str) – The text to chunk.

  • return_str (bool) – If True, return the chunks as strings instead of Chunk objects. Default is False.

Returns:

The chunks of the text.

Return type:

list[Chunk]

class flexrag.chunking.RecursiveChunkerConfig(tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, max_tokens=512, split_pattern=<factory>)[source]#

Bases: tokenizer_config

Configuration for RecursiveChunker.

Parameters:
  • max_tokens (int) – The maximum number of tokens in each chunk. Default is 512.

  • seperators (dict[str, str]) – The seperators used to split text recursively. The order of the seperators matters. Default is PREDEFINED_SPLIT_PATTERNS["en"].

For example, to split a text recursively with 256 tokens in each chunk:

from flexrag.chunking import RecursiveChunkerConfig, RecursiveChunker

cfg = RecursiveChunkerConfig(max_tokens=256)
chunker = RecursiveChunker(cfg)

You can also specify your own seperator list:

from flexrag.chunking import RecursiveChunkerConfig, RecursiveChunker

cfg = RecursiveChunkerConfig(
    max_tokens=256,
    split_pattern={"level1": "pattern1", "level2": "pattern2"},
)
chunker = RecursiveChunker(cfg)

Note that the RecursiveChunker relies on the regex pattern to split the text, thus you need to make sure your pattern will not consume the splitter. A good practice is to use the lookbehind and lookahead assertion to avoid consuming the splitter.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.RecursiveChunker(cfg)[source]#

Bases: ChunkerBase

RecursiveChunker splits text into chunks recursively using the specified seperators.

The order of the seperators matters. The text will be split recursively based on the seperators in the order of the list. The default seperators are defined in PREDEFINED_SPLIT_PATTERNS.

If the text is still too long after splitting with the last level seperators, the text will be split into tokens.

chunk(text, return_str=False)[source]#

Chunk the given text into smaller chunks.

Parameters:
  • text (str) – The text to chunk.

  • return_str (bool) – If True, return the chunks as strings instead of Chunk objects. Default is False.

Returns:

The chunks of the text.

Return type:

list[Chunk]

class flexrag.chunking.SentenceChunkerConfig(sentence_splitter_type='regex', nltk_splitter_config=<factory>, regex_config=<factory>, spacy_config=<factory>, tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, max_sents=None, max_tokens=None, max_chars=None, overlap=0)[source]#

Bases: tokenizer_config, SentenceSplitterConfig

Configuration for SentenceChunker.

Parameters:
  • max_sents (Optional[int]) – The maximum number of sentences in each chunk. Default is None.

  • max_tokens (Optional[int]) – The maximum number of tokens in each chunk. Default is None.

  • max_chars (Optional[int]) – The maximum number of characters in each chunk. Default is None.

  • overlap (int) – The number of sentences to overlap between chunks. Default is 0.

For example, to chunk a text into chunks with 10 sentences in each chunk:

from flexrag.chunking import SentenceChunkerConfig, SentenceChunker

cfg = SentenceChunkerConfig(max_sents=10)
chunker = SentenceChunker(cfg)

Note that the SentenceChunker relies on the sentence splitter to split the text, thus the space between may be lost if the sentence splitter is not reversible.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.SentenceChunker(cfg)[source]#

Bases: ChunkerBase

SentenceChunker first splits text into sentences using the specified sentence splitter, then merges the sentences into chunks based on the specified constraints.

chunk(text, return_str=False)[source]#

Chunk the given text into smaller chunks.

Parameters:
  • text (str) – The text to chunk.

  • return_str (bool) – If True, return the chunks as strings instead of Chunk objects. Default is False.

Returns:

The chunks of the text.

Return type:

list[Chunk]

class flexrag.chunking.SemanticChunkerConfig(tokenizer_type='tiktoken', hf_config=<factory>, tiktoken_config=<factory>, moses_config=<factory>, nltk_tokenizer_config=<factory>, jieba_config=<factory>, encoder_type=None, cohere_config=<factory>, hf_clip_config=<factory>, jina_config=<factory>, ollama_config=<factory>, openai_config=<factory>, sentence_transformer_config=<factory>, sentence_splitter_type='regex', nltk_splitter_config=<factory>, regex_config=<factory>, spacy_config=<factory>, max_tokens=None, max_tokens_per_sentence=None, threshold=None, threshold_percentile=None, similarity_window=1, similarity_function='COS')[source]#

Bases: SentenceSplitterConfig, EncoderConfig, tokenizer_config

Configuration for SemanticChunker.

Parameters:
  • max_tokens (Optional[int]) – The maximum number of tokens in each chunk. Default is None.

  • threshold (Optional[float]) – The threshold for semantic similarity. Default is None. If provided, the threshold_percentile and max_tokens will be ignored.

  • threshold_percentile (Optional[float]) – The ratio of the threshold for semantic similarity. Default is None. Should be a value between 0 and 100. Higher values will result in more chunks. 5 is a good starting point. If provided, the max_tokens will be ignored.

  • similarity_window (Optional[int]) – The window size for calculating semantic similarity. Default is None.

  • similarity_function (str) – The similarity function to use. Default is “COS”. Available choices are “L2” for the reciprocal of euclidean distance, “IP” for inner product, and “COS” for cosine similarity.

The similarity higher than the threshold will be considered as coherent, and the chunks will be split at the points where the similarity is below the threshold. Thus, at least one of max_tokens, threshold, or threshold_percentile should be provided. If threshold is provided, the chunks will be split directly based on the threshold. If threshold_percentile is provided, the threshold will be calculated automatically based on the similarity distribution. If max_tokens is provided, the threshold will be calculated to ensure the chunks are within the token limit.

For example, to split the text into chunks with a maximum of 512 tokens, you can use the following configuration:

>>> from flexrag.chunking import SemanticChunker, SemanticChunkerConfig
>>> from flexrag.models import HFEncoderConfig
>>> config = SemanticChunkerConfig(
...     max_tokens=512,
...     encoder_type="hf",
...     hf_config=HFEncoderConfig(model_path="BAAI/bge-small-en-v1.5"),
... )
>>> chunker = SemanticChunker(config)

To split the text into chunks with a threshold_percentile of 5%, you can use the following configuration:

>>> config = SemanticChunkerConfig(
...     threshold_percentile=5,
...     encoder_type="hf",
...     hf_config=HFEncoderConfig(model_path="BAAI/bge-small-en-v1.5"),
... )
>>> chunker = SemanticChunker(config)

To split the text into chunks with a given threshold, you can use the following configuration:

>>> config = SemanticChunkerConfig(
...     threshold=0.8,
...     encoder_type="hf",
...     hf_config=HFEncoderConfig(model_path="BAAI/bge-small-en-v1.5"),
... )
>>> chunker = SemanticChunker(config)
dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.SemanticChunker(cfg)[source]#

Bases: ChunkerBase

SemanticChunker splits text into sentences and then groups them into chunks based on semantic similarity. This chunker is inspired by the Greg Kamradt’s wonderful notebook: FullStackRetrieval-com/RetrievalTutorials

chunk(text, return_str=False)[source]#

Chunk the given text into smaller chunks.

Parameters:
  • text (str) – The text to chunk.

  • return_str (bool) – If True, return the chunks as strings instead of Chunk objects. Default is False.

Returns:

The chunks of the text.

Return type:

list[Chunk]

Sentence Splitters#

This submodule provides a set of useful tools for splitting a text into sentences.

class flexrag.chunking.sentence_splitter.SentenceSplitterBase[source]#

Sentence splitter that splits text into sentences. This is an abstract class that defines the interface for all sentence splitters. The subclasses should implement the split method to split the text. The reversible property should return True if the splitted sentences can be concatenate back to the original text.

abstract property reversible#

return True if the splitted sentences can be concatenate back to the original text.

abstract split(text)[source]#

Split the given text into sentences.

Parameters:

text (str) – The text to split.

Returns:

The sentences of the text.

Return type:

list[str]

class flexrag.chunking.sentence_splitter.NLTKSentenceSplitterConfig(language='english')[source]#

Configuration for NLTKSentenceSplitter.

Parameters:

language (str) – The language to use for the sentence splitter. Default is “english”.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.sentence_splitter.NLTKSentenceSplitter(cfg)[source]#

Bases: SentenceSplitterBase

NLTKSentenceSplitter splits text into sentences using NLTK’s PunktSentenceTokenizer. For more information, see https://www.nltk.org/api/nltk.tokenize.punkt.html#module-nltk.tokenize.punkt.

property reversible#

NLTKSentenceSplitter is not reversible as it may lose spaces between sentences.

split(text)[source]#

Split the given text into sentences.

Parameters:

text (str) – The text to split.

Returns:

The sentences of the text.

Return type:

list[str]

class flexrag.chunking.sentence_splitter.RegexSplitterConfig(pattern='(?<=[.?!])')[source]#

Configuration for RegexSentenceSplitter.

Parameters:

pattern (str) – The regular expression pattern to split the text. Default is PREDEFINED_SPLIT_PATTERNS["en"]["sentence"]

Note that some patterns may lose the seperators between sentences. A good practice is to use the lookbehind and lookahead assertion to avoid consuming the splitter.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.sentence_splitter.RegexSplitter(cfg)[source]#

Bases: SentenceSplitterBase

RegexSentenceSplitter splits text into sentences using a regular expression pattern.

Note that this splitter uses the regex module, which might be slightly different from the built-in re module.

property reversible#

The default RegexSplitter is reversible. However, the reversibility depends on the pattern used.

split(text)[source]#

Split the given text into sentences.

Parameters:

text (str) – The text to split.

Returns:

The sentences of the text.

Return type:

list[str]

class flexrag.chunking.sentence_splitter.SpacySentenceSplitterConfig(model='en_core_web_sm')[source]#

Configuration for SpacySentenceSplitter.

Parameters:

model (str) – The spaCy model to use for sentence splitting. Default is “en_core_web_sm”.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.chunking.sentence_splitter.SpacySentenceSplitter(cfg)[source]#

Bases: SentenceSplitterBase

SpacySentenceSplitter splits text into sentences using spaCy’s sentence splitter.

property reversible#

SpacySentenceSplitter is not reversible as it may lose spaces between sentences.

split(text)[source]#

Split the given text into sentences.

Parameters:

text (str) – The text to split.

Returns:

The sentences of the text.

Return type:

list[str]

sentence_splitter.PREDEFINED_SPLIT_PATTERNS = {'en': {'big_paragraph': '(?<=\\R{2,})', 'paragraph': '(?<=\\R)', 'sentence': '(?<=[.?!])', 'subsentence': '(?<=[,;\\"\'{}<>\\[\\]`~])', 'word': '(?<=\\s)'}, 'zh': {'big_paragraph': '(?<=\\R{2,})', 'paragraph': '(?<=\\R)', 'setence': '(?<=[。!?])', 'subsentence': '(?<=[,;:“”‘’《》【】、])'}}#

A dictionary of predefined sentence splitting patterns. The keys are the names of the patterns, and the values are the corresponding regular expressions. Currently, FlexRAG provides 2 sets of predefined patterns: “en” for English and “zh” for Chinese. Please refer to the source code for more details.

General Configuration#

The configuration provides a general interface for loading and configurate the chunker or the sentence splitter.

class flexrag.chunking.ChunkerConfig(chunker_type='sentence_chunker', char_chunker_config=<factory>, token_chunker_config=<factory>, recursive_chunker_config=<factory>, sentence_chunker_config=<factory>, semantic_chunker_config=<factory>)#

Configuration class for chunker (name: ChunkerConfig, default: sentence_chunker).

Parameters:
class flexrag.chunking.sentence_splitter.SentenceSplitterConfig(sentence_splitter_type='regex', nltk_splitter_config=<factory>, regex_config=<factory>, spacy_config=<factory>)#

Configuration class for sentence_splitter (name: SentenceSplitterConfig, default: regex).

Parameters: