Processing PDF

Menghuan1918July 12, 2024About 2 min

Warning

After version 0.3.1 the output has been updated to logging, which by default only outputs Warning and above. If you want to see the processing, set the logging level to INFO:

import logging
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.WARNING)
logging.basicConfig(level=logging.INFO)

For demonstration purposes, the following code examples all set the logging level to INFO.

`Client.pdf2file`

Convert one or more PDF files to a specified format.

Parameters

Parameter	Type	Required	Default	Description
`pdf_file`	`str` or `list`	Yes	-	Path to the PDF file or list of PDF file paths
`output_path`	`str`	No	`"./Output"`	Output folder path
`output_names`	`list`	No	`None`	List of output filenames, must be the same length as `pdf_file`. If filenames include folder paths, the system will automatically create the corresponding folder structure
`output_format`	`str`	No	`"md_dollar"`	Output format, optional values: `"texts"`, `"md"`, `"md_dollar"`, `"latex"`, `"docx"`
`ocr`	`bool`	No	`True`	Whether to use OCR
`convert`	`bool`	No	`False`	Whether to convert `[` to `$`, and `[[` to `$$` (only effective when `output_format` is `"texts"`)

Return Value

Returns a tuple containing three elements (list1, list2, status), in the same order as the input files:

list1 (list): List of successfully processed files
- Elements are paths to processed files (strings)
- Empty string if processing failed
list2 (list): List of failed-to-process files
- Elements are dictionaries containing two keys:
  - 'error': Error message (string)
  - 'path': Path to the file that failed to process (string)
- Both keys have empty string values if processing succeeded
status (bool): Processing status
- True: At least one file failed to process
- False: All files processed successfully

Notes

The lengths of list1 and list2 are the same
When the output format is "texts", text is returned directly and not saved to a file

Example

Tips

In the following example, sample_bad.pdf is a corrupted file, so it is normal for processing to fail.

Warning

Please make sure you have configured the key in environment variables as described in the initialization section.

Convert a single PDF to a LaTeX file and specify the output filename

from pdfdeal import Doc2X

client = Doc2X()
filepath, _, _ = client.pdf2file(
    "tests/pdf/sample.pdf", output_names=["Folder/Test.zip"], output_format="latex"
)
print(filepath)

Example output when successful:

['./Output/Folder/Test.zip']

Example output when processing fails:

['']

Convert PDFs in a folder to DOCX files while maintaining original structure

In order to maintain the original file structure, use the built-in Directory Generation Tool to generate the paths of the images to be processed:

Warning

Note that the out parameter of get_files must match the output_format in the conversion function on this page!

from pdfdeal import Doc2X
from pdfdeal import get_files
Client = Doc2X()
file_list, rename_list = get_files(
    path="./tests/pdf", mode="pdf", out="docx"
)
success, failed, flag = Client.pdf2file(
    pdf_file=file_list,
    output_path="./Output/newfolder",
    output_names=rename_list,
    output_format="docx",
)
print(success)
print(failed)
print(flag)

The file structure of ./tests/pdf is as follows:

pdf
├── sample_bad.pdf
├── sample.pdf
└── test
    └── sampleB.pdf

Note that sample_bad.pdf is a corrupted file used for testing error handling; it is normal for processing to fail.

Expected output:

PDF Progress: 2/3 files successfully processed.
-----
Failed deal with ./tests/pdf/sample_bad.pdf with error:
Error Upload file error! 400:{"code":"invalid request","msg":"bad params"}
-----
['./Output/newfolder/sample.docx', '', './Output/newfolder/test/sampleB.docx']
[{'error': '', 'path': ''}, {'error': 'Error Upload file error! 400:{"code":"invalid request","msg":"bad params"}', 'path': './tests/pdf/sample_bad.pdf'}, {'error': '', 'path': ''}]
True

And the resulting file structure:

Output
└── newfolder
    ├── sample.docx
    └── test
        └── sampleB.docx