For RAG Enhancement
Warning
After version 0.3.1 the output has been updated to logging
, which by default only outputs Warning and above. If you want to see the processing, set the logging
level to INFO:
import logging
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.WARNING)
logging.basicConfig(level=logging.INFO)
For demonstration purposes, the following code examples all set the logging
level to INFO.
Client.pdfdeal
Process PDF files and convert them into files more suitable for the RAG system.
Caution
If you want to convert PDF files to other formats, please use the Client.pdf2file function
Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
pdf_file | str or list | Yes | - | Input file path, or a list of input file paths |
output_format | str | No | "pdf" | Output format, accepts 'pdf' , 'md' or 'texts' . Default is "pdf" |
output_names | list | No | None | Custom output file names, must match the length of pdf_file . If the file name contains a folder path, the system will automatically create the corresponding folder structure. Default is None |
output_path | str | No | "./Output" | Output path. Default is "./Output" |
convert | bool | No | True | Whether to convert [ to $ , and [[ to $$ . Default is True |
Return Values
Returns a tuple (list1, list2, bool)
containing three elements in the same order as the input files:
list1
(list
): List of successfully processed file paths- Elements are the paths of processed files (strings)
- Empty string if processing failed
list2
(list
): List of failed files- Elements are dictionaries containing two keys:
'error'
: Error message (string)'path'
: Path of the failed file (string)
- Both keys have empty string values if processing succeeded
- Elements are dictionaries containing two keys:
bool
: Processing statusTrue
: At least one file processing failedFalse
: All files processed successfully
Notes
- The lengths of
list1
andlist2
are the same - When the output format is
"texts"
, text is returned directly without saving to a file
Example
Warning
Please ensure you have configured your key in environment variables as per the initialization section.
Warning
When the output format is PDF, the conversion process does not retain the original document's layout. The converted PDF only contains recognized text content and generates a new PDF according to the original page numbers. This approach may cause text to exceed page boundaries, affecting human reading. However, it does not affect RAG system content reading.
The advantage is that it retains the PDF page number where the text is located, making it easier to trace back in the RAG system.
Recognize all PDFs in a folder and output as recognized PDFs
To maintain the original file structure, use the built-in directory generation tool to generate paths for images that need processing:
from pdfdeal import Doc2X
from pdfdeal import get_files
client = Doc2X()
file_list, rename = get_files(path="tests/pdf", mode="pdf", out="pdf")
success, failed, flag = client.pdfdeal(
pdf_file=file_list,
output_path="./Output/test/multiple/pdfdeal",
output_names=rename,
)
print(success)
print(failed)
print(flag)
The file structure of ./tests/pdf
is:
pdf
├── sample_bad.pdf
├── sample.pdf
└── test
└── sampleB.pdf
Note that
sample_bad.pdf
is a corrupted file used for testing exception handling; it is normal for processing to fail.
Expected output:
Get exception Upload file error! 400:{"code":"bad_request","msg":"Invalid parameters or bad request"}. Retrying in 1 second.
Waiting for processing: 0% -- uuid: 49993be3-d3b6-4990-b8bf-9989a2942bfb
Processing file: 6% -- uuid: 0199cdd8-48b0-4987-a795-2dd11e73918e
Get exception Upload file error! 400:{"code":"bad_request","msg":"Invalid parameters or bad request"}. Retrying in 2 seconds.
Processing file: 6% -- uuid: 49993be3-d3b6-4990-b8bf-9989a2942bfb
Get exception Upload file error! 400:{"code":"bad_request","msg":"Invalid parameters or bad request"}. Retrying in 4 seconds.
PDFDEAL Progress: 2/3 files successfully processed.
-----
Failed deal with tests/pdf/sample_bad.pdf with error:
Upload file error! 400:{"code":"bad_request","msg":"Invalid parameters or bad request"}
-----
['./Output/test/multiple/pdfdeal/sample.pdf', '', './Output/test/multiple/pdfdeal/test/sampleB.pdf']
[{'error': '', 'path': ''}, {'error': Exception('Upload file error! 400:{"code":"bad_request","msg":"Invalid parameters or bad request"}'), 'path': 'tests/pdf/sample_bad.pdf'}, {'error': '', 'path': ''}]
True
Processed file structure:
pdfdeal
├── sample.pdf
└── test
└── sampleB.pdf