RAP Logo
Blocks ReferenceMulti modal

Document Classifier

Classify documents (images) with optional OCR data using various machine learning algorithms.

Document Classifier Block

The Document Classifier block is designed for classifying documents (images) with optional Optical Character Recognition (OCR) data. It uses various machine learning algorithms to automatically categorize documents based on their visual content and extracted text.

Overview

The Document Classifier combines computer vision and natural language processing to classify documents into predefined categories. It can analyze both the visual appearance of documents and their textual content to make accurate classification decisions.

Configuration Options

Algorithm Selection

Choose the classification algorithm from the Choose Algorithm dropdown:

  • Deep Learning Classifier: Advanced neural network-based classification
  • Traditional ML Classifier: Machine learning algorithms for document classification
  • Hybrid Classifier: Combines visual and textual features
  • Custom Classifier: User-defined classification models

Input Configuration

Document Input

  • Property: msg.payload.document_path
  • Type: string
  • Description: Path to the document image file
  • Supported formats: .png, .jpg, .jpeg, .pdf, .tiff

OCR Data (Optional)

  • Property: msg.payload.ocr_data
  • Type: object or string
  • Description: Pre-extracted OCR text data
  • Format: JSON object with text and confidence scores

Classification Categories

  • Property: msg.payload.categories
  • Type: array
  • Description: List of possible document categories
  • Example: ["invoice", "contract", "receipt", "letter"]

Processing Options

Confidence Threshold

  • Type: number
  • Range: 0.0 to 1.0
  • Default: 0.7
  • Description: Minimum confidence score for classification

Multi-label Classification

  • Type: boolean
  • Default: false
  • Description: Allow documents to be classified into multiple categories

Use Cases

Document Management

Automatically categorize incoming documents:

Document upload → Document Classifier → Category assignment → File organization

Invoice Processing

Classify different types of financial documents:

Invoice image → Document Classifier → Invoice type → Processing workflow

Categorize legal documents by type:

Legal document → Document Classifier → Document type → Legal workflow

Archive Organization

Organize historical documents:

Archive scan → Document Classifier → Category → Archive structure

Common Patterns

Basic Document Classification

// Configuration
// Algorithm: Deep Learning Classifier
// Confidence Threshold: 0.8

// Input message:
{
  "payload": {
    "document_path": "documents/invoice_001.pdf",
    "categories": ["invoice", "receipt", "contract", "letter"]
  }
}

// Example flow:
// inject → Document Classifier → debug (classification results)

Classification with OCR

// Configuration
// Algorithm: Hybrid Classifier
// Multi-label: true

// Input message:
{
  "payload": {
    "document_path": "scans/document.png",
    "ocr_data": {
      "text": "INVOICE #12345...",
      "confidence": 0.95
    },
    "categories": ["invoice", "receipt", "contract"]
  }
}

// Example flow:
// OCR → Document Classifier → debug (classification with OCR)

Batch Document Processing

// Configuration
// Algorithm: Traditional ML Classifier
// Confidence Threshold: 0.6

// Process multiple documents
// Example flow:
// batch of documents → Document Classifier → individual classifications

Advanced Features

Multi-label Classification

When enabled, documents can be assigned to multiple categories:

{
  "classification_results": {
    "primary_category": "invoice",
    "confidence": 0.92,
    "all_categories": [
      {
        "category": "invoice",
        "confidence": 0.92
      },
      {
        "category": "financial_document",
        "confidence": 0.85
      }
    ]
  }
}

Confidence Scoring

Detailed confidence analysis:

{
  "classification_results": {
    "predicted_category": "contract",
    "confidence": 0.89,
    "confidence_breakdown": {
      "visual_features": 0.85,
      "textual_features": 0.92,
      "combined_score": 0.89
    },
    "alternative_categories": [
      {
        "category": "agreement",
        "confidence": 0.76
      },
      {
        "category": "legal_document",
        "confidence": 0.68
      }
    ]
  }
}

Custom Model Integration

Use pre-trained or custom classification models:

// Configuration
// Algorithm: Custom Classifier
// Model Path: "models/custom_document_classifier.pkl"

// Input message:
{
  "payload": {
    "document_path": "documents/special_doc.pdf",
    "model_path": "models/custom_classifier.pkl",
    "categories": ["type_a", "type_b", "type_c"]
  }
}

Output Structure

Classification Results

{
  "document_path": "documents/invoice_001.pdf",
  "classification_results": {
    "predicted_category": "invoice",
    "confidence": 0.94,
    "processing_time": 1.2,
    "algorithm_used": "Deep Learning Classifier",
    "features_analyzed": {
      "visual_features": true,
      "textual_features": true,
      "layout_features": true
    }
  },
  "metadata": {
    "document_size": "2.3MB",
    "page_count": 1,
    "processing_timestamp": "2024-01-15T10:30:00Z"
  }
}

Multi-label Results

{
  "document_path": "documents/mixed_doc.pdf",
  "classification_results": {
    "primary_category": "invoice",
    "confidence": 0.89,
    "all_categories": [
      {
        "category": "invoice",
        "confidence": 0.89
      },
      {
        "category": "financial_document",
        "confidence": 0.82
      },
      {
        "category": "business_document",
        "confidence": 0.75
      }
    ],
    "multi_label_enabled": true
  }
}

Algorithm Details

Deep Learning Classifier

  • Best for: Complex document types with visual patterns
  • Features: Convolutional neural networks for image analysis
  • Accuracy: High for well-defined document types
  • Training: Requires labeled training data

Traditional ML Classifier

  • Best for: Simple document types with clear features
  • Features: Support Vector Machines, Random Forest
  • Accuracy: Good for structured documents
  • Training: Faster training, less data required

Hybrid Classifier

  • Best for: Documents with both visual and textual content
  • Features: Combines image and text analysis
  • Accuracy: Highest for mixed-content documents
  • Training: Requires both visual and textual training data

Tips for Best Results

Document Quality

  • High resolution: Use clear, high-quality document images
  • Good contrast: Ensure text and images are clearly visible
  • Consistent format: Use consistent document layouts when possible

Training Data

  • Diverse samples: Include various examples of each document type
  • Balanced dataset: Ensure equal representation of all categories
  • Quality labels: Use accurate, consistent category labels

Configuration

  • Appropriate algorithm: Choose algorithm based on document complexity
  • Confidence threshold: Adjust based on accuracy requirements
  • Category definition: Use clear, distinct category names

Performance Optimization

  • Batch processing: Process multiple documents together when possible
  • Model caching: Cache trained models for faster processing
  • Resource management: Monitor memory usage for large documents

Common Issues and Solutions

Low Classification Accuracy

  • Issue: Poor classification results
  • Solution: Improve document quality, retrain model with more data

Slow Processing

  • Issue: Long processing times
  • Solution: Use lighter algorithms, optimize document size

Memory Issues

  • Issue: Out of memory errors
  • Solution: Process documents in smaller batches, reduce image resolution

Category Confusion

  • Issue: Documents misclassified between similar categories
  • Solution: Refine category definitions, add more training examples