Processing PDF
Warning
After version 0.3.1 the output has been updated to logging
, which by default only outputs Warning and above. If you want to see the processing, set the logging
level to INFO:
import logging
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.WARNING)
logging.basicConfig(level=logging.INFO)
For demonstration purposes, the following code examples all set the logging
level to INFO.
Client.pdf2file
Convert one or more PDF files to a specified format.
Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
pdf_file | str or list | Yes | - | Path to the PDF file or list of PDF file paths |
output_path | str | No | "./Output" | Output folder path |
output_names | list | No | None | List of output filenames, must be the same length as pdf_file . If filenames include folder paths, the system will automatically create the corresponding folder structure |
output_format | str | No | "md_dollar" | Output format, optional values: "texts" , "md" , "md_dollar" , "latex" , "docx" |
ocr | bool | No | True | Whether to use OCR |
convert | bool | No | False | Whether to convert [ to $ , and [[ to $$ (only effective when output_format is "texts" ) |
Return Value
Returns a tuple containing three elements (list1, list2, status)
, in the same order as the input files:
list1
(list
): List of successfully processed files- Elements are paths to processed files (strings)
- Empty string if processing failed
list2
(list
): List of failed-to-process files- Elements are dictionaries containing two keys:
'error'
: Error message (string)'path'
: Path to the file that failed to process (string)
- Both keys have empty string values if processing succeeded
- Elements are dictionaries containing two keys:
status
(bool
): Processing statusTrue
: At least one file failed to processFalse
: All files processed successfully
Notes
- The lengths of
list1
andlist2
are the same - When the output format is
"texts"
, text is returned directly and not saved to a file
Example
Tips
In the following example, sample_bad.pdf
is a corrupted file, so it is normal for processing to fail.
Warning
Please make sure you have configured the key in environment variables as described in the initialization section.
Convert a single PDF to a LaTeX file and specify the output filename
from pdfdeal import Doc2X
client = Doc2X()
filepath, _, _ = client.pdf2file(
"tests/pdf/sample.pdf", output_names=["Folder/Test.zip"], output_format="latex"
)
print(filepath)
Example output when successful:
['./Output/Folder/Test.zip']
Example output when processing fails:
['']
Convert PDFs in a folder to DOCX files while maintaining original structure
In order to maintain the original file structure, use the built-in Directory Generation Tool to generate the paths of the images to be processed:
Warning
Note that the out
parameter of get_files
must match the output_format
in the conversion function on this page!
from pdfdeal import Doc2X
from pdfdeal import get_files
Client = Doc2X()
file_list, rename_list = get_files(
path="./tests/pdf", mode="pdf", out="docx"
)
success, failed, flag = Client.pdf2file(
pdf_file=file_list,
output_path="./Output/newfolder",
output_names=rename_list,
output_format="docx",
)
print(success)
print(failed)
print(flag)
The file structure of ./tests/pdf
is as follows:
pdf
├── sample_bad.pdf
├── sample.pdf
└── test
└── sampleB.pdf
Note that
sample_bad.pdf
is a corrupted file used for testing error handling; it is normal for processing to fail.
Expected output:
PDF Progress: 2/3 files successfully processed.
-----
Failed deal with ./tests/pdf/sample_bad.pdf with error:
Error Upload file error! 400:{"code":"invalid request","msg":"bad params"}
-----
['./Output/newfolder/sample.docx', '', './Output/newfolder/test/sampleB.docx']
[{'error': '', 'path': ''}, {'error': 'Error Upload file error! 400:{"code":"invalid request","msg":"bad params"}', 'path': './tests/pdf/sample_bad.pdf'}, {'error': '', 'path': ''}]
True
And the resulting file structure:
Output
└── newfolder
├── sample.docx
└── test
└── sampleB.docx