PDF Processor
Process and manipulate PDF documents with various operations including text extraction, page manipulation, and metadata handling.
PDF Processor Block
The PDF Processor block is designed to handle various operations on PDF documents, including text extraction, page manipulation, metadata handling, and document transformation. It provides comprehensive PDF processing capabilities for document workflows.
Overview
The PDF Processor block enables you to perform a wide range of operations on PDF documents, making it essential for document processing workflows. It can handle both simple and complex PDF manipulation tasks.
Configuration Options
Processing Mode
Choose the type of PDF operation to perform:
- Text Extraction: Extract text content from PDF pages
- Page Operations: Split, merge, rotate, or extract specific pages
- Metadata Operations: Read or modify document metadata
- Document Conversion: Convert PDF to other formats
- Security Operations: Add or remove password protection
- Form Processing: Extract or fill PDF form data
Text Extraction Options
When extracting text:
- Page Range: Specify which pages to process (e.g., "1-5", "all", "last")
- Text Format: Choose output format (plain text, structured text, JSON)
- Include Coordinates: Extract text with position information
- Language Detection: Automatically detect document language
- OCR Fallback: Use OCR for scanned PDFs when text extraction fails
Page Operations
For page manipulation:
- Split Pages: Divide PDF into separate documents
- Merge Documents: Combine multiple PDFs into one
- Rotate Pages: Rotate pages by 90, 180, or 270 degrees
- Extract Pages: Extract specific pages to new documents
- Insert Pages: Add pages from other documents
Metadata Configuration
- Read Metadata: Extract document properties (title, author, creation date)
- Update Metadata: Modify document information
- Custom Properties: Add or modify custom metadata fields
- Preserve Original: Keep original metadata when possible
How It Works
Input Processing
- Document Validation: Verifies PDF format and accessibility
- Security Check: Handles password-protected documents
- Page Analysis: Analyzes document structure and content
- Operation Selection: Applies the configured processing mode
Text Extraction Process
- Page Parsing: Extracts text from PDF pages
- Layout Analysis: Preserves document structure and formatting
- Text Cleaning: Removes artifacts and normalizes text
- Coordinate Mapping: Maps text to page positions (if enabled)
- Output Formatting: Structures text according to selected format
Page Manipulation Process
- Document Loading: Loads source PDF documents
- Page Selection: Identifies pages to process
- Operation Execution: Performs the specified page operation
- Document Assembly: Creates new document structure
- Output Generation: Produces processed PDF or extracted content
Use Cases
Document Text Extraction
Extract text from PDF documents for further processing:
PDF Processor (Text Extraction) → Text Processor → LLM QueryDocument Splitting
Split large PDFs into smaller documents:
PDF Processor (Split Pages) → Multiple PDF outputs → StorageDocument Merging
Combine multiple PDFs into a single document:
Multiple PDF inputs → PDF Processor (Merge) → Single PDF outputMetadata Processing
Extract and process document metadata:
PDF Processor (Read Metadata) → Change (Process metadata) → StorageDocument Conversion
Convert PDFs to other formats:
PDF Processor (Convert) → Text/Image output → Further processingConfiguration Examples
Basic Text Extraction
{
"mode": "text_extraction",
"page_range": "all",
"text_format": "plain",
"include_coordinates": false,
"ocr_fallback": true
}Page Splitting
{
"mode": "page_operations",
"operation": "split",
"split_method": "by_page_count",
"pages_per_document": 10
}Metadata Extraction
{
"mode": "metadata_operations",
"operation": "read",
"include_custom_properties": true,
"output_format": "json"
}Document Merging
{
"mode": "page_operations",
"operation": "merge",
"merge_order": "by_filename",
"preserve_bookmarks": true
}Advanced Features
Batch Processing
Process multiple PDFs in a single operation:
// Process multiple documents
var documents = msg.payload.documents;
var results = [];
for (var i = 0; i < documents.length; i++) {
var result = processPDF(documents[i]);
results.push(result);
}
msg.payload = {
processed_documents: results,
total_count: documents.length,
};Custom Text Processing
Apply custom text processing during extraction:
// Custom text cleaning
var extractedText = msg.payload.text;
var cleanedText = extractedText
.replace(/\s+/g, " ")
.replace(/[^\w\s.,!?]/g, "")
.trim();
msg.payload.text = cleanedText;Error Handling
Handle various PDF processing errors:
// Error handling
if (msg.error) {
switch (msg.error.code) {
case "PASSWORD_REQUIRED":
msg.payload = { error: "Password required", action: "request_password" };
break;
case "CORRUPTED_PDF":
msg.payload = { error: "Corrupted PDF", action: "skip_document" };
break;
default:
msg.payload = { error: "Processing failed", action: "retry" };
}
}Performance Considerations
Large Document Handling
- Memory Management: Process large PDFs in chunks
- Page Limits: Set reasonable page limits for processing
- Timeout Settings: Configure appropriate timeouts
- Resource Monitoring: Monitor memory and CPU usage
Optimization Tips
- Use page range selection to process only needed pages
- Enable OCR fallback only when necessary
- Cache processed results for repeated operations
- Use appropriate text format for your use case
Common Issues and Solutions
Password-Protected PDFs
Issue: Cannot process password-protected documents Solution: Configure password handling or use security operations mode
Corrupted PDFs
Issue: Processing fails on corrupted documents Solution: Enable error handling and fallback options
Large File Processing
Issue: Memory issues with large PDFs Solution: Use page range selection and batch processing
Text Extraction Quality
Issue: Poor text extraction results Solution: Enable OCR fallback and coordinate mapping
Tips
- Test with Sample Documents: Verify processing with representative PDFs
- Use Appropriate Modes: Select the right processing mode for your needs
- Handle Errors Gracefully: Implement proper error handling
- Optimize for Performance: Use page ranges and batch processing
- Preserve Document Structure: Use structured text formats when needed
- Monitor Resource Usage: Keep track of memory and processing time
Related Blocks
- OCR - For scanned PDF text extraction
- Text Processor - For post-processing extracted text
- Storage Blocks - For saving processed documents
- Template Matcher - For document template matching