Document Parser#

This module provides a set of classes and functions for parsing a formated document (such as PDF, Word, etc.) into a structured format.

class flexrag.document_parser.Document(source_file_path, title=None, text=None, screenshots=<factory>, images=<factory>)[source]#

A document parsed by a DocumentParser.

dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.document_parser.DocumentParserBase[source]#
abstract parse(document_path)[source]#

Parse the document at the given path.

Parameters:

document_path (str) – The path to the document to parse.

Returns:

The parsed document.

Return type:

Document

class flexrag.document_parser.DoclingConfig(do_ocr=False, do_table_structure=True, generate_page_images=False, generate_picture_images=False)[source]#
dump(path)#

Dump the dataclass to a YAML file.

dumps()#

Dump the dataclass to a YAML string.

classmethod load(path)#

Load the dataclass from a YAML file.

classmethod loads(s)#

Load the dataclass from a YAML string.

class flexrag.document_parser.DoclingParser(config)[source]#

Bases: DocumentParserBase

class flexrag.document_parser.MarkItDownParser[source]#

Bases: DocumentParserBase