Document Parser#
This module provides a set of classes and functions for parsing a formated document (such as PDF, Word, etc.) into a structured format.
- class flexrag.document_parser.Document(source_file_path, title=None, text=None, screenshots=<factory>, images=<factory>)[source]#
A document parsed by a DocumentParser.
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.document_parser.DoclingConfig(do_ocr=False, do_table_structure=True, generate_page_images=False, generate_picture_images=False)[source]#
- dump(path)#
Dump the dataclass to a YAML file.
- dumps()#
Dump the dataclass to a YAML string.
- classmethod load(path)#
Load the dataclass from a YAML file.
- classmethod loads(s)#
Load the dataclass from a YAML string.
- class flexrag.document_parser.DoclingParser(config)[source]#
Bases:
DocumentParserBase
- class flexrag.document_parser.MarkItDownParser[source]#
Bases:
DocumentParserBase