Document Classifier
Classify documents (images) with optional OCR data using various machine learning algorithms.
Document Classifier Block
The Document Classifier block is designed for classifying documents (images) with optional Optical Character Recognition (OCR) data. It uses various machine learning algorithms to automatically categorize documents based on their visual content and extracted text.
Overview
The Document Classifier combines computer vision and natural language processing to classify documents into predefined categories. It can analyze both the visual appearance of documents and their textual content to make accurate classification decisions.
Configuration Options
Algorithm Selection
Choose the classification algorithm from the Choose Algorithm dropdown:
- Deep Learning Classifier: Advanced neural network-based classification
- Traditional ML Classifier: Machine learning algorithms for document classification
- Hybrid Classifier: Combines visual and textual features
- Custom Classifier: User-defined classification models
Input Configuration
Document Input
- Property:
msg.payload.document_path - Type: string
- Description: Path to the document image file
- Supported formats: .png, .jpg, .jpeg, .pdf, .tiff
OCR Data (Optional)
- Property:
msg.payload.ocr_data - Type: object or string
- Description: Pre-extracted OCR text data
- Format: JSON object with text and confidence scores
Classification Categories
- Property:
msg.payload.categories - Type: array
- Description: List of possible document categories
- Example:
["invoice", "contract", "receipt", "letter"]
Processing Options
Confidence Threshold
- Type: number
- Range: 0.0 to 1.0
- Default: 0.7
- Description: Minimum confidence score for classification
Multi-label Classification
- Type: boolean
- Default: false
- Description: Allow documents to be classified into multiple categories
Use Cases
Document Management
Automatically categorize incoming documents:
Document upload → Document Classifier → Category assignment → File organizationInvoice Processing
Classify different types of financial documents:
Invoice image → Document Classifier → Invoice type → Processing workflowLegal Document Processing
Categorize legal documents by type:
Legal document → Document Classifier → Document type → Legal workflowArchive Organization
Organize historical documents:
Archive scan → Document Classifier → Category → Archive structureCommon Patterns
Basic Document Classification
// Configuration
// Algorithm: Deep Learning Classifier
// Confidence Threshold: 0.8
// Input message:
{
"payload": {
"document_path": "documents/invoice_001.pdf",
"categories": ["invoice", "receipt", "contract", "letter"]
}
}
// Example flow:
// inject → Document Classifier → debug (classification results)Classification with OCR
// Configuration
// Algorithm: Hybrid Classifier
// Multi-label: true
// Input message:
{
"payload": {
"document_path": "scans/document.png",
"ocr_data": {
"text": "INVOICE #12345...",
"confidence": 0.95
},
"categories": ["invoice", "receipt", "contract"]
}
}
// Example flow:
// OCR → Document Classifier → debug (classification with OCR)Batch Document Processing
// Configuration
// Algorithm: Traditional ML Classifier
// Confidence Threshold: 0.6
// Process multiple documents
// Example flow:
// batch of documents → Document Classifier → individual classificationsAdvanced Features
Multi-label Classification
When enabled, documents can be assigned to multiple categories:
{
"classification_results": {
"primary_category": "invoice",
"confidence": 0.92,
"all_categories": [
{
"category": "invoice",
"confidence": 0.92
},
{
"category": "financial_document",
"confidence": 0.85
}
]
}
}Confidence Scoring
Detailed confidence analysis:
{
"classification_results": {
"predicted_category": "contract",
"confidence": 0.89,
"confidence_breakdown": {
"visual_features": 0.85,
"textual_features": 0.92,
"combined_score": 0.89
},
"alternative_categories": [
{
"category": "agreement",
"confidence": 0.76
},
{
"category": "legal_document",
"confidence": 0.68
}
]
}
}Custom Model Integration
Use pre-trained or custom classification models:
// Configuration
// Algorithm: Custom Classifier
// Model Path: "models/custom_document_classifier.pkl"
// Input message:
{
"payload": {
"document_path": "documents/special_doc.pdf",
"model_path": "models/custom_classifier.pkl",
"categories": ["type_a", "type_b", "type_c"]
}
}Output Structure
Classification Results
{
"document_path": "documents/invoice_001.pdf",
"classification_results": {
"predicted_category": "invoice",
"confidence": 0.94,
"processing_time": 1.2,
"algorithm_used": "Deep Learning Classifier",
"features_analyzed": {
"visual_features": true,
"textual_features": true,
"layout_features": true
}
},
"metadata": {
"document_size": "2.3MB",
"page_count": 1,
"processing_timestamp": "2024-01-15T10:30:00Z"
}
}Multi-label Results
{
"document_path": "documents/mixed_doc.pdf",
"classification_results": {
"primary_category": "invoice",
"confidence": 0.89,
"all_categories": [
{
"category": "invoice",
"confidence": 0.89
},
{
"category": "financial_document",
"confidence": 0.82
},
{
"category": "business_document",
"confidence": 0.75
}
],
"multi_label_enabled": true
}
}Algorithm Details
Deep Learning Classifier
- Best for: Complex document types with visual patterns
- Features: Convolutional neural networks for image analysis
- Accuracy: High for well-defined document types
- Training: Requires labeled training data
Traditional ML Classifier
- Best for: Simple document types with clear features
- Features: Support Vector Machines, Random Forest
- Accuracy: Good for structured documents
- Training: Faster training, less data required
Hybrid Classifier
- Best for: Documents with both visual and textual content
- Features: Combines image and text analysis
- Accuracy: Highest for mixed-content documents
- Training: Requires both visual and textual training data
Tips for Best Results
Document Quality
- High resolution: Use clear, high-quality document images
- Good contrast: Ensure text and images are clearly visible
- Consistent format: Use consistent document layouts when possible
Training Data
- Diverse samples: Include various examples of each document type
- Balanced dataset: Ensure equal representation of all categories
- Quality labels: Use accurate, consistent category labels
Configuration
- Appropriate algorithm: Choose algorithm based on document complexity
- Confidence threshold: Adjust based on accuracy requirements
- Category definition: Use clear, distinct category names
Performance Optimization
- Batch processing: Process multiple documents together when possible
- Model caching: Cache trained models for faster processing
- Resource management: Monitor memory usage for large documents
Common Issues and Solutions
Low Classification Accuracy
- Issue: Poor classification results
- Solution: Improve document quality, retrain model with more data
Slow Processing
- Issue: Long processing times
- Solution: Use lighter algorithms, optimize document size
Memory Issues
- Issue: Out of memory errors
- Solution: Process documents in smaller batches, reduce image resolution
Category Confusion
- Issue: Documents misclassified between similar categories
- Solution: Refine category definitions, add more training examples
Related Blocks
- OCR - For text extraction before classification
- Image Processor - For document preprocessing
- PDF Processor - For PDF document handling
- NLP Classifier - For text-based classification
- Document Understander - For document analysis