PDF Processor

Provides PDF utilities such as searchability checks, OCR, table extraction, and splitting PDFs into images.

Quick Start

To get started:

Configuration varies by operation type.

Relative path of the PDF file on shared storage.

Example: "documents/report.pdf"

Output format varies by operation (image paths, OCR results, table extraction results).

The PDF Processor block supports the following operations:

Searchable or Scanned (is_searchable)
- Input: msg.payload.pdf_path (string)
- Config: line_count (number)
- Output (msg.payload): true | false
Split PDF to Images (split_pdf_to_images)
- Input: msg.payload.pdf_path (string)
- Config: dpi (number)
- Output (msg.payload): array of generated image filenames (strings)
PDF OCR (pdf_ocr)
- Input: msg.payload.pdf_path (string)
- Optional input: msg.payload.pages (array of page numbers) for page-scoped OCR modes
- Config: ocr_type (string)
- Config: content_type (string, only for “extract content” modes)
- Config: width / height (numbers, only for “pages and words” modes)
- Output (msg.payload): varies by selected ocr_type
Table Extraction (table_extraction)
- Input: msg.payload.pdf_path (string)
- Config: Type of Extractor with options Algorithm - 1, Algorithm - 2, or Default Extraction
- Output (msg.payload): extracted table artifact(s), depending on the selected extractor