Process and manipulate PDF documents with various operations including text extraction, page manipulation, and metadata handling.

PDF Processor Block

The PDF Processor block is designed to handle various operations on PDF documents, including text extraction, page manipulation, metadata handling, and document transformation. It provides comprehensive PDF processing capabilities for document workflows.

Overview

The PDF Processor block enables you to perform a wide range of operations on PDF documents, making it essential for document processing workflows. It can handle both simple and complex PDF manipulation tasks.

Configuration Options

Processing Mode

Choose the type of PDF operation to perform:

Text Extraction: Extract text content from PDF pages
Page Operations: Split, merge, rotate, or extract specific pages
Metadata Operations: Read or modify document metadata
Document Conversion: Convert PDF to other formats
Security Operations: Add or remove password protection
Form Processing: Extract or fill PDF form data

Text Extraction Options

When extracting text:

Page Range: Specify which pages to process (e.g., "1-5", "all", "last")
Text Format: Choose output format (plain text, structured text, JSON)
Include Coordinates: Extract text with position information
Language Detection: Automatically detect document language
OCR Fallback: Use OCR for scanned PDFs when text extraction fails

Page Operations

For page manipulation:

Split Pages: Divide PDF into separate documents
Merge Documents: Combine multiple PDFs into one
Rotate Pages: Rotate pages by 90, 180, or 270 degrees
Extract Pages: Extract specific pages to new documents
Insert Pages: Add pages from other documents

Metadata Configuration

Read Metadata: Extract document properties (title, author, creation date)
Update Metadata: Modify document information
Custom Properties: Add or modify custom metadata fields
Preserve Original: Keep original metadata when possible

How It Works

Input Processing

Document Validation: Verifies PDF format and accessibility
Security Check: Handles password-protected documents
Page Analysis: Analyzes document structure and content
Operation Selection: Applies the configured processing mode

Text Extraction Process

Page Parsing: Extracts text from PDF pages
Layout Analysis: Preserves document structure and formatting
Text Cleaning: Removes artifacts and normalizes text
Coordinate Mapping: Maps text to page positions (if enabled)
Output Formatting: Structures text according to selected format

Page Manipulation Process

Document Loading: Loads source PDF documents
Page Selection: Identifies pages to process
Operation Execution: Performs the specified page operation
Document Assembly: Creates new document structure
Output Generation: Produces processed PDF or extracted content

Use Cases

Document Text Extraction

Extract text from PDF documents for further processing:

PDF Processor (Text Extraction) → Text Processor → LLM Query

Document Splitting

Split large PDFs into smaller documents:

PDF Processor (Split Pages) → Multiple PDF outputs → Storage

Document Merging

Combine multiple PDFs into a single document:

Multiple PDF inputs → PDF Processor (Merge) → Single PDF output

Metadata Processing

Extract and process document metadata:

PDF Processor (Read Metadata) → Change (Process metadata) → Storage

Document Conversion

Convert PDFs to other formats:

PDF Processor (Convert) → Text/Image output → Further processing

Configuration Examples

Basic Text Extraction

{
  "mode": "text_extraction",
  "page_range": "all",
  "text_format": "plain",
  "include_coordinates": false,
  "ocr_fallback": true
}

Page Splitting

{
  "mode": "page_operations",
  "operation": "split",
  "split_method": "by_page_count",
  "pages_per_document": 10
}

Metadata Extraction

{
  "mode": "metadata_operations",
  "operation": "read",
  "include_custom_properties": true,
  "output_format": "json"
}

Document Merging

{
  "mode": "page_operations",
  "operation": "merge",
  "merge_order": "by_filename",
  "preserve_bookmarks": true
}

Advanced Features

Batch Processing

Process multiple PDFs in a single operation:

// Process multiple documents
var documents = msg.payload.documents;
var results = [];

for (var i = 0; i < documents.length; i++) {
  var result = processPDF(documents[i]);
  results.push(result);
}

msg.payload = {
  processed_documents: results,
  total_count: documents.length,
};

Custom Text Processing

Apply custom text processing during extraction:

// Custom text cleaning
var extractedText = msg.payload.text;
var cleanedText = extractedText
  .replace(/\s+/g, " ")
  .replace(/[^\w\s.,!?]/g, "")
  .trim();

msg.payload.text = cleanedText;

Error Handling

Handle various PDF processing errors:

// Error handling
if (msg.error) {
  switch (msg.error.code) {
    case "PASSWORD_REQUIRED":
      msg.payload = { error: "Password required", action: "request_password" };
      break;
    case "CORRUPTED_PDF":
      msg.payload = { error: "Corrupted PDF", action: "skip_document" };
      break;
    default:
      msg.payload = { error: "Processing failed", action: "retry" };
  }
}

Performance Considerations

Large Document Handling

Memory Management: Process large PDFs in chunks
Page Limits: Set reasonable page limits for processing
Timeout Settings: Configure appropriate timeouts
Resource Monitoring: Monitor memory and CPU usage

Optimization Tips

Use page range selection to process only needed pages
Enable OCR fallback only when necessary
Cache processed results for repeated operations
Use appropriate text format for your use case

Test with Sample Documents: Verify processing with representative PDFs
Use Appropriate Modes: Select the right processing mode for your needs
Handle Errors Gracefully: Implement proper error handling
Optimize for Performance: Use page ranges and batch processing
Preserve Document Structure: Use structured text formats when needed
Monitor Resource Usage: Keep track of memory and processing time

OCR - For scanned PDF text extraction
Text Processor - For post-processing extracted text
Storage Blocks - For saving processed documents
Template Matcher - For document template matching

PDF Processor

PDF Processor Block

Overview

Configuration Options

Processing Mode

Text Extraction Options

Page Operations

Metadata Configuration

How It Works

Input Processing

Text Extraction Process

Page Manipulation Process

Use Cases

Document Text Extraction

Document Splitting

Document Merging

Metadata Processing

Document Conversion

Configuration Examples

Basic Text Extraction

Page Splitting

Metadata Extraction

Document Merging

Advanced Features

Batch Processing

Custom Text Processing

Error Handling

Performance Considerations

Large Document Handling

Optimization Tips

Common Issues and Solutions

Password-Protected PDFs

Corrupted PDFs

Large File Processing

Text Extraction Quality

Tips

On this page