deal_pdf
Use local OCR to recognize image text and clean up the format. Currently, built-in support includes: easyocr and pytesseract, of course, you can also customize the OCR function--this is also very simple.
deal_pdf
Process PDF files and use OCR to improve their readability, suitable for RAG (Retrieval-Augmented Generation).
Parameters
| Parameter | Type | Required | Default Value | Description |
|---|---|---|---|---|
pdf_file | str or list | Yes | - | Input PDF file path, supports string or string list |
output_format | str | No | "pdf" | Output format, optional values: "texts", "md", "pdf" |
output_names | list | No | None | Custom output file name list, length must be the same as pdf_file |
ocr | function or str | No | None | Custom OCR/tool function, uses easyocr if not defined. Optional values: "pytesseract" to use pytesseract, "pass" to skip OCR |
language | list | No | ["ch_sim", "en"] | Languages used by OCR, default value is ["ch_sim", "en"] (for easyocr), ["eng"] (for pytesseract) |
GPU | bool | No | False | Whether to use GPU in OCR, default value is False, not applicable for pytesseract |
output_path | str | No | "./Output" | Output folder path, used only when output format is "md" or "pdf" |
option | dict | No | {} | Options for OCR/tool |
Return Values
Returns a tuple containing three elements (list1, list2, status):
list1(list): List of successfully processed file paths- Elements are paths of processed files (strings)
- Empty string if processing failed
list2(list): List of failed files- Elements are dictionaries containing two keys:
'error': Error message (string)'file': Path of the failed file (string)
- Both keys are empty strings if processing succeeded
- Elements are dictionaries containing two keys:
status(bool): Processing statusTrue: At least one file processing failedFalse: All files processed successfully
Notes
- Lengths of both lists,
list1, andlist2, are the same - When the output format is
"texts", text is returned directly without saving to a file - The parameter ocr can be a custom OCR function or the name of a built-in OCR tool (such as
"easyocr"or"pytesseract") - If output_names is not None, successfully processed files will be renamed as specified
Using pytesseract
When using “pytesseract”, make sure tesseract is installed first tesseract:
pip install 'pdfdeal[pytesseract]'Example:
from pdfdeal import deal_pdf, get_files
files, rename = get_files("tests/pdf", "pdf", "md")
output_path, failed, flag = deal_pdf(
pdf_file=files,
output_format="md",
ocr="pytesseract",
language=["eng"],
output_path="Output",
output_names=rename,
)
for f in output_path:
print(f"Save processed file to {f}")Using easyocr:
pip install 'pdfdeal[easyocr]'Example: Since I am running on a device without CUDA acceleration, set GPU to False.
from pdfdeal import deal_pdf, get_files
files, rename = get_files("tests/pdf", "pdf", "md")
output_path, failed, flag = deal_pdf(
pdf_file=files,
output_format="md",
ocr="easyocr",
language=["en"],
GPU=False,
output_path="Output",
output_names=rename,
)
for f in output_path:
print(f"Save processed file to {f}")Custom OCR Function!
It’s very simple; you only need to customize a function:
def ocr(path, language:list, options: dict) -> Tuple[str, bool]:
# Your OCR implementation
return texts, All_DoneThe options will at least pass in {"GPU": GPU} information; here the GPU value is determined by the input parameters of deal_pdf. You need to implement OCR for this path file or folder and concatenate the results returned by OCR. For example, here is an example of a custom function that skips OCR:
from pdfdeal import deal_pdf, get_files
def ocr(path, language=["auto"], options: dict = None):
return "", True
files, rename = get_files("tests/pdf", "pdf", "md")
output_path, failed, flag = deal_pdf(
pdf_file=files,
output_format="md",
ocr=ocr,
output_path="Output",
output_names=rename,
)
for f in output_path:
print(f"Save processed file to {f}")Doc2X?
Please use Client.pdfdeal function; however it will be merged into this function in future versions.