Preparing the Knowledge Base#

In the real world, various types of knowledge are typically stored in documents such as PDFs, Word files, and PPTs. However, this semi-structured data cannot be parsed by large language models (LLMs) and is not suitable for building a knowledge base. Therefore, we need to convert it into structured text data beforehand. In this tutorial, we will use a simple example to demonstrate how to convert a batch of PDF files into structured data.

Tip

If you already have structured data, you can skip this tutorial.

Parse Files using FlexRAG’s Command-Line Tool#

FlexRAG provides a command-line tool prepare_corpus to help users parse various files into structured data. In this tutorial, we will use a paper from Arxiv as an example to demonstrate how to parse a PDF file using the built-in command-line tool of FlexRAG.

Run the following command to download a paper from Arxiv:

wget https://arxiv.org/pdf/2502.18139.pdf

You can then run the following command to parse this paper into structured knowledge base data:

python -m flexrag.entrypoints.prepare_corpus \
    document_paths=[2502.18139.pdf] \
    output_path=knowledge.jsonl \
    document_parser_type=markitdown \
    chunker_type=sentence_chunker \
    sentence_chunker_config.max_tokens=512 \
    sentence_chunker_config.tokenizer_type=tiktoken \
    sentence_chunker_config.tiktoken_config.model_name='gpt-4o'

In this command, we specify the following parameters:

document_paths：a list of file paths to be parsed. Here we only parse one paper;
output_path：the output path of the parsed results. The path should end with .jsonl, .csv, or .tsv;
document_parser_type：the type of document parser. Here we use markitdown;
chunker_type：the type of text chunker. Here we use sentence_chunker;
sentence_chunker_config.max_tokens：the maximum length of the text chunker. Here we set it to 512;
sentence_chunker_config.tokenizer_type：the type of tokenizer used by the text chunker. Here we use tiktoken, which is provided by OpenAI;
sentence_chunker_config.tiktoken_config.model_name：the model name used by the tokenizer. Here we use gpt-4o.

After executing the above command, you will see that the PDF file has been parsed into a JSONL file. As shown in the figure below, FlexRAG executed three steps in this process:

Parsing: parsing the file into structured data;
Chunking: chunking long text paragraphs in the structured data into short text paragraphs suitable for processing;
Preprocessing: preprocessing and filtering the chunked text paragraphs.

Tip

You can check the FlexRAG Entrypoints documentation for more information about the prepare_corpus command.

Tip

You can check the Preparing the Retriever documentation for how to build a retriever for your knowledge base.

Preparing the Knowledge Base

Contents

Preparing the Knowledge Base#

Parse Files using FlexRAG’s Command-Line Tool#