RAP Logo
Blocks ReferenceComputer vision

PDF Processor

Process and manipulate PDF documents with various operations including text extraction, page manipulation, and metadata handling.

PDF Processor Block

The PDF Processor block is designed to handle various operations on PDF documents, including text extraction, page manipulation, metadata handling, and document transformation. It provides comprehensive PDF processing capabilities for document workflows.

Overview

The PDF Processor block enables you to perform a wide range of operations on PDF documents, making it essential for document processing workflows. It can handle both simple and complex PDF manipulation tasks.

Configuration Options

Processing Mode

Choose the type of PDF operation to perform:

  • Text Extraction: Extract text content from PDF pages
  • Page Operations: Split, merge, rotate, or extract specific pages
  • Metadata Operations: Read or modify document metadata
  • Document Conversion: Convert PDF to other formats
  • Security Operations: Add or remove password protection
  • Form Processing: Extract or fill PDF form data

Text Extraction Options

When extracting text:

  • Page Range: Specify which pages to process (e.g., "1-5", "all", "last")
  • Text Format: Choose output format (plain text, structured text, JSON)
  • Include Coordinates: Extract text with position information
  • Language Detection: Automatically detect document language
  • OCR Fallback: Use OCR for scanned PDFs when text extraction fails

Page Operations

For page manipulation:

  • Split Pages: Divide PDF into separate documents
  • Merge Documents: Combine multiple PDFs into one
  • Rotate Pages: Rotate pages by 90, 180, or 270 degrees
  • Extract Pages: Extract specific pages to new documents
  • Insert Pages: Add pages from other documents

Metadata Configuration

  • Read Metadata: Extract document properties (title, author, creation date)
  • Update Metadata: Modify document information
  • Custom Properties: Add or modify custom metadata fields
  • Preserve Original: Keep original metadata when possible

How It Works

Input Processing

  1. Document Validation: Verifies PDF format and accessibility
  2. Security Check: Handles password-protected documents
  3. Page Analysis: Analyzes document structure and content
  4. Operation Selection: Applies the configured processing mode

Text Extraction Process

  1. Page Parsing: Extracts text from PDF pages
  2. Layout Analysis: Preserves document structure and formatting
  3. Text Cleaning: Removes artifacts and normalizes text
  4. Coordinate Mapping: Maps text to page positions (if enabled)
  5. Output Formatting: Structures text according to selected format

Page Manipulation Process

  1. Document Loading: Loads source PDF documents
  2. Page Selection: Identifies pages to process
  3. Operation Execution: Performs the specified page operation
  4. Document Assembly: Creates new document structure
  5. Output Generation: Produces processed PDF or extracted content

Use Cases

Document Text Extraction

Extract text from PDF documents for further processing:

PDF Processor (Text Extraction) → Text Processor → LLM Query

Document Splitting

Split large PDFs into smaller documents:

PDF Processor (Split Pages) → Multiple PDF outputs → Storage

Document Merging

Combine multiple PDFs into a single document:

Multiple PDF inputs → PDF Processor (Merge) → Single PDF output

Metadata Processing

Extract and process document metadata:

PDF Processor (Read Metadata) → Change (Process metadata) → Storage

Document Conversion

Convert PDFs to other formats:

PDF Processor (Convert) → Text/Image output → Further processing

Configuration Examples

Basic Text Extraction

{
  "mode": "text_extraction",
  "page_range": "all",
  "text_format": "plain",
  "include_coordinates": false,
  "ocr_fallback": true
}

Page Splitting

{
  "mode": "page_operations",
  "operation": "split",
  "split_method": "by_page_count",
  "pages_per_document": 10
}

Metadata Extraction

{
  "mode": "metadata_operations",
  "operation": "read",
  "include_custom_properties": true,
  "output_format": "json"
}

Document Merging

{
  "mode": "page_operations",
  "operation": "merge",
  "merge_order": "by_filename",
  "preserve_bookmarks": true
}

Advanced Features

Batch Processing

Process multiple PDFs in a single operation:

// Process multiple documents
var documents = msg.payload.documents;
var results = [];

for (var i = 0; i < documents.length; i++) {
  var result = processPDF(documents[i]);
  results.push(result);
}

msg.payload = {
  processed_documents: results,
  total_count: documents.length,
};

Custom Text Processing

Apply custom text processing during extraction:

// Custom text cleaning
var extractedText = msg.payload.text;
var cleanedText = extractedText
  .replace(/\s+/g, " ")
  .replace(/[^\w\s.,!?]/g, "")
  .trim();

msg.payload.text = cleanedText;

Error Handling

Handle various PDF processing errors:

// Error handling
if (msg.error) {
  switch (msg.error.code) {
    case "PASSWORD_REQUIRED":
      msg.payload = { error: "Password required", action: "request_password" };
      break;
    case "CORRUPTED_PDF":
      msg.payload = { error: "Corrupted PDF", action: "skip_document" };
      break;
    default:
      msg.payload = { error: "Processing failed", action: "retry" };
  }
}

Performance Considerations

Large Document Handling

  • Memory Management: Process large PDFs in chunks
  • Page Limits: Set reasonable page limits for processing
  • Timeout Settings: Configure appropriate timeouts
  • Resource Monitoring: Monitor memory and CPU usage

Optimization Tips

  • Use page range selection to process only needed pages
  • Enable OCR fallback only when necessary
  • Cache processed results for repeated operations
  • Use appropriate text format for your use case

Common Issues and Solutions

Password-Protected PDFs

Issue: Cannot process password-protected documents Solution: Configure password handling or use security operations mode

Corrupted PDFs

Issue: Processing fails on corrupted documents Solution: Enable error handling and fallback options

Large File Processing

Issue: Memory issues with large PDFs Solution: Use page range selection and batch processing

Text Extraction Quality

Issue: Poor text extraction results Solution: Enable OCR fallback and coordinate mapping

Tips

  • Test with Sample Documents: Verify processing with representative PDFs
  • Use Appropriate Modes: Select the right processing mode for your needs
  • Handle Errors Gracefully: Implement proper error handling
  • Optimize for Performance: Use page ranges and batch processing
  • Preserve Document Structure: Use structured text formats when needed
  • Monitor Resource Usage: Keep track of memory and processing time