Tokenizer

Tokenizer#

This module is a simple wrapper around other tokenizers. It provides a simple and consistent interface for tokenizing a text into tokens (maybe string or int).

The Tokenizer Interface#

TokenizerBase is the base class for all tokenizers.

class flexrag.models.tokenizer.TokenizerBase[源代码]#

TokenizerBase is an abstract class that defines the interface for all tokenizers. These tokenizers are useful in the text_processing module and the chunking module.

The subclasses should implement the tokenize and detokenize methods to convert text to tokens and vice versa. The reversible property should return True if the tokenizer can detokenize the tokens back to the original text.

abstract detokenize(tokens)[源代码]#

Detokenize the tokens back to text.

参数:: tokens (list[TokenType]) -- The tokens to detokenize.
返回:: The detokenized text.
返回类型:: str

abstract property reversible#: Return True if the tokenizer can detokenize the tokens back to the original text.

abstract tokenize(texts)[源代码]#

Tokenize the given text into tokens.

参数:: texts (str) -- The text to tokenize.
返回:: The tokens of the text.
返回类型:: list[TokenType]

Tokenizers#

The wrapped tokenizers.

class flexrag.models.tokenizer.HuggingFaceTokenizerConfig(tokenizer_path=None)[源代码]#

Configuration for HuggingFaceTokenizer.

参数:: tokenizer_path (str) -- The path to the HuggingFace tokenizer.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.models.tokenizer.HuggingFaceTokenizer(cfg)[源代码]#

基类：TokenizerBase[int]

A wrapper for HuggingFace tokenizers.

detokenize(tokens)[源代码]#

Detokenize the tokens back to text.

参数:: tokens (list[TokenType]) -- The tokens to detokenize.
返回:: The detokenized text.
返回类型:: str

property reversible#: Most HuggingFace tokenizers that employs BPE/SPM model are reversible.

tokenize(texts)[源代码]#

Tokenize the given text into tokens.

参数:: texts (str) -- The text to tokenize.
返回:: The tokens of the text.
返回类型:: list[TokenType]

class flexrag.models.tokenizer.TikTokenTokenizerConfig(tokenizer_name=None, model_name='gpt-4o')[源代码]#

Configuration for TikTokenTokenizer.

参数:

tokenizer_name (Optional[str]) -- Load the tokenizer by the name. Default is None.
model_name (Optional[str]) -- Load the tokenizer by the corresponding OpenAI's model. Default is "gpt-4o".

At least one of tokenizer_name or model_name must be provided.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.models.tokenizer.TikTokenTokenizer(cfg)[源代码]#

基类：TokenizerBase[int]

A wrapper for TikToken tokenizers.

detokenize(tokens)[源代码]#

Detokenize the tokens back to text.

参数:: tokens (list[TokenType]) -- The tokens to detokenize.
返回:: The detokenized text.
返回类型:: str

property reversible#: TikTokenTokenizer is reversible.

tokenize(texts)[源代码]#

Tokenize the given text into tokens.

参数:: texts (str) -- The text to tokenize.
返回:: The tokens of the text.
返回类型:: list[TokenType]

class flexrag.models.tokenizer.MosesTokenizerConfig(lang='en')[源代码]#

Configuration for MosesTokenizer.

参数:: lang (str) -- The language code for the tokenizer. Default is "en".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.models.tokenizer.MosesTokenizer(cfg)[源代码]#

基类：TokenizerBase[str]

A wrapper for SacreMoses tokenizers.

detokenize(tokens)[源代码]#

Detokenize the tokens back to text.

参数:: tokens (list[TokenType]) -- The tokens to detokenize.
返回:: The detokenized text.
返回类型:: str

property reversible#: MosesTokenizer is not reversible as it may lose sapces and punctuations.

tokenize(texts)[源代码]#

Tokenize the given text into tokens.

参数:: texts (str) -- The text to tokenize.
返回:: The tokens of the text.
返回类型:: list[TokenType]

class flexrag.models.tokenizer.NLTKTokenizerConfig(lang='english')[源代码]#

Configuration for NLTKTokenizer.

参数:: lang (str) -- The language to use for the tokenizer. Default is "english".

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.models.tokenizer.NLTKTokenizer(cfg)[源代码]#

基类：TokenizerBase[str]

A wrapper for NLTK tokenizers.

detokenize(tokens)[源代码]#

Detokenize the tokens back to text.

参数:: tokens (list[TokenType]) -- The tokens to detokenize.
返回:: The detokenized text.
返回类型:: str

property reversible#: NLTKTokenizer is not reversible as it may lose sapces.

tokenize(texts)[源代码]#

Tokenize the given text into tokens.

参数:: texts (str) -- The text to tokenize.
返回:: The tokens of the text.
返回类型:: list[TokenType]

class flexrag.models.tokenizer.JiebaTokenizerConfig(enable_hmm=True, cut_all=False)[源代码]#

Configuration for JiebaTokenizer.

参数:

enable_hmm (bool) -- Whether to use the Hidden Markov Model. Default is True.
cut_all (bool) -- Whether to use the full mode. Default is False.

dump(path)#: Dump the dataclass to a YAML file.

dumps()#: Dump the dataclass to a YAML string.

classmethod load(path)#: Load the dataclass from a YAML file.

classmethod loads(s)#: Load the dataclass from a YAML string.

class flexrag.models.tokenizer.JiebaTokenizer(cfg)[源代码]#

基类：TokenizerBase[str]

A wrapper for Jieba tokenizers.

detokenize(tokens)[源代码]#

Detokenize the tokens back to text.

参数:: tokens (list[TokenType]) -- The tokens to detokenize.
返回:: The detokenized text.
返回类型:: str

property reversible#: JiebaTokenizer is reversible.

tokenize(texts)[源代码]#

Tokenize the given text into tokens.

参数:: texts (str) -- The text to tokenize.
返回:: The tokens of the text.
返回类型:: list[TokenType]

Tokenizer

目录

Tokenizer#

The Tokenizer Interface#

Tokenizers#