Preparing the Knowledge Base#
In the real world, various types of knowledge are typically stored in documents such as PDFs, Word files, and PPTs. However, this semi-structured data cannot be parsed by large language models (LLMs) and is not suitable for building a knowledge base. Therefore, we need to convert it into structured text data beforehand. In this tutorial, we will use a simple example to demonstrate how to convert a batch of PDF files into structured data.
Tip
If you already have structured data, you can skip this tutorial.
Parse Files using FlexRAG’s Command-Line Tool#
FlexRAG provides a command-line tool prepare_corpus to help users parse various files into structured data. In this tutorial, we will use a paper from Arxiv as an example to demonstrate how to parse a PDF file using the built-in command-line tool of FlexRAG.
Run the following command to download a paper from Arxiv:
wget https://arxiv.org/pdf/2502.18139.pdf
You can then run the following command to parse this paper into structured knowledge base data:
python -m flexrag.entrypoints.prepare_corpus \
document_paths=[2502.18139.pdf] \
output_path=knowledge.jsonl \
document_parser_type=markitdown \
chunker_type=sentence_chunker \
sentence_chunker_config.max_tokens=512 \
sentence_chunker_config.tokenizer_type=tiktoken \
sentence_chunker_config.tiktoken_config.model_name='gpt-4o'
In this command, we specify the following parameters:
document_paths:a list of file paths to be parsed. Here we only parse one paper;output_path:the output path of the parsed results. The path should end with.jsonl,.csv, or.tsv;document_parser_type:the type of document parser. Here we usemarkitdown;chunker_type:the type of text chunker. Here we usesentence_chunker;sentence_chunker_config.max_tokens:the maximum length of the text chunker. Here we set it to 512;sentence_chunker_config.tokenizer_type:the type of tokenizer used by the text chunker. Here we usetiktoken, which is provided by OpenAI;sentence_chunker_config.tiktoken_config.model_name:the model name used by the tokenizer. Here we usegpt-4o.
After executing the above command, you will see that the PDF file has been parsed into a JSONL file. As shown in the figure below, FlexRAG executed three steps in this process:
Parsing: parsing the file into structured data;
Chunking: chunking long text paragraphs in the structured data into short text paragraphs suitable for processing;
Preprocessing: preprocessing and filtering the chunked text paragraphs.
Tip
You can check the FlexRAG Entrypoints documentation for more information about the prepare_corpus command.
Tip
You can check the Preparing the Retriever documentation for how to build a retriever for your knowledge base.