Processing PDF
Warning
After version 0.3.1 the output has been updated to logging, which by default only outputs Warning and above. If you want to see the processing, set the logging level to INFO:
import logging
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.WARNING)
logging.basicConfig(level=logging.INFO)For demonstration purposes, the following code examples all set the logging level to INFO.
Client.pdf2file
Convert one or more PDF files to a specified format.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
pdf_file | str or list | Yes | - | Path to the PDF file or list of PDF file paths |
output_path | str | No | "./Output" | Output folder path |
output_names | list | No | None | List of output filenames, must be the same length as pdf_file. If filenames include folder paths, the system will automatically create the corresponding folder structure |
output_format | str | No | "md_dollar" | Output format, optional values: "texts", "md", "md_dollar", "latex", "docx" |
ocr | bool | No | True | Whether to use OCR |
convert | bool | No | False | Whether to convert [ to $, and [[ to $$ (only effective when output_format is "texts") |
Return Value
Returns a tuple containing three elements (list1, list2, status), in the same order as the input files:
list1(list): List of successfully processed files- Elements are paths to processed files (strings)
- Empty string if processing failed
list2(list): List of failed-to-process files- Elements are dictionaries containing two keys:
'error': Error message (string)'path': Path to the file that failed to process (string)
- Both keys have empty string values if processing succeeded
- Elements are dictionaries containing two keys:
status(bool): Processing statusTrue: At least one file failed to processFalse: All files processed successfully
Notes
- The lengths of
list1andlist2are the same - When the output format is
"texts", text is returned directly and not saved to a file
Example
Tips
In the following example, sample_bad.pdf is a corrupted file, so it is normal for processing to fail.
Warning
Please make sure you have configured the key in environment variables as described in the initialization section.
Convert a single PDF to a LaTeX file and specify the output filename
from pdfdeal import Doc2X
client = Doc2X()
filepath, _, _ = client.pdf2file(
"tests/pdf/sample.pdf", output_names=["Folder/Test.zip"], output_format="latex"
)
print(filepath)Example output when successful:
['./Output/Folder/Test.zip']Example output when processing fails:
['']Convert PDFs in a folder to DOCX files while maintaining original structure
In order to maintain the original file structure, use the built-in Directory Generation Tool to generate the paths of the images to be processed:
Warning
Note that the out parameter of get_files must match the output_format in the conversion function on this page!
from pdfdeal import Doc2X
from pdfdeal import get_files
Client = Doc2X()
file_list, rename_list = get_files(
path="./tests/pdf", mode="pdf", out="docx"
)
success, failed, flag = Client.pdf2file(
pdf_file=file_list,
output_path="./Output/newfolder",
output_names=rename_list,
output_format="docx",
)
print(success)
print(failed)
print(flag)The file structure of ./tests/pdf is as follows:
pdf
├── sample_bad.pdf
├── sample.pdf
└── test
└── sampleB.pdfNote that
sample_bad.pdfis a corrupted file used for testing error handling; it is normal for processing to fail.
Expected output:
PDF Progress: 2/3 files successfully processed.
-----
Failed deal with ./tests/pdf/sample_bad.pdf with error:
Error Upload file error! 400:{"code":"invalid request","msg":"bad params"}
-----
['./Output/newfolder/sample.docx', '', './Output/newfolder/test/sampleB.docx']
[{'error': '', 'path': ''}, {'error': 'Error Upload file error! 400:{"code":"invalid request","msg":"bad params"}', 'path': './tests/pdf/sample_bad.pdf'}, {'error': '', 'path': ''}]
TrueAnd the resulting file structure:
Output
└── newfolder
├── sample.docx
└── test
└── sampleB.docx