Classify documents (images) with optional OCR data using various machine learning algorithms.

Document Classifier Block

The Document Classifier block is designed for classifying documents (images) with optional Optical Character Recognition (OCR) data. It uses various machine learning algorithms to automatically categorize documents based on their visual content and extracted text.

Overview

The Document Classifier combines computer vision and natural language processing to classify documents into predefined categories. It can analyze both the visual appearance of documents and their textual content to make accurate classification decisions.

Configuration Options

Algorithm Selection

Choose the classification algorithm from the Choose Algorithm dropdown:

Deep Learning Classifier: Advanced neural network-based classification
Traditional ML Classifier: Machine learning algorithms for document classification
Hybrid Classifier: Combines visual and textual features
Custom Classifier: User-defined classification models

Input Configuration

Document Input

Property: msg.payload.document_path
Type: string
Description: Path to the document image file
Supported formats: .png, .jpg, .jpeg, .pdf, .tiff

OCR Data (Optional)

Property: msg.payload.ocr_data
Type: object or string
Description: Pre-extracted OCR text data
Format: JSON object with text and confidence scores

Classification Categories

Property: msg.payload.categories
Type: array
Description: List of possible document categories
Example: ["invoice", "contract", "receipt", "letter"]

Processing Options

Confidence Threshold

Type: number
Range: 0.0 to 1.0
Default: 0.7
Description: Minimum confidence score for classification

Multi-label Classification

Type: boolean
Default: false
Description: Allow documents to be classified into multiple categories

Use Cases

Document Management

Automatically categorize incoming documents:

Document upload → Document Classifier → Category assignment → File organization

Invoice Processing

Classify different types of financial documents:

Invoice image → Document Classifier → Invoice type → Processing workflow

Legal Document Processing

Categorize legal documents by type:

Legal document → Document Classifier → Document type → Legal workflow

Archive Organization

Organize historical documents:

Archive scan → Document Classifier → Category → Archive structure

Common Patterns

Basic Document Classification

// Configuration
// Algorithm: Deep Learning Classifier
// Confidence Threshold: 0.8

// Input message:
{
  "payload": {
    "document_path": "documents/invoice_001.pdf",
    "categories": ["invoice", "receipt", "contract", "letter"]
  }
}

// Example flow:
// inject → Document Classifier → debug (classification results)

Classification with OCR

// Configuration
// Algorithm: Hybrid Classifier
// Multi-label: true

// Input message:
{
  "payload": {
    "document_path": "scans/document.png",
    "ocr_data": {
      "text": "INVOICE #12345...",
      "confidence": 0.95
    },
    "categories": ["invoice", "receipt", "contract"]
  }
}

// Example flow:
// OCR → Document Classifier → debug (classification with OCR)

Batch Document Processing

// Configuration
// Algorithm: Traditional ML Classifier
// Confidence Threshold: 0.6

// Process multiple documents
// Example flow:
// batch of documents → Document Classifier → individual classifications

Advanced Features

Multi-label Classification

When enabled, documents can be assigned to multiple categories:

{
  "classification_results": {
    "primary_category": "invoice",
    "confidence": 0.92,
    "all_categories": [
      {
        "category": "invoice",
        "confidence": 0.92
      },
      {
        "category": "financial_document",
        "confidence": 0.85
      }
    ]
  }
}

Confidence Scoring

Detailed confidence analysis:

{
  "classification_results": {
    "predicted_category": "contract",
    "confidence": 0.89,
    "confidence_breakdown": {
      "visual_features": 0.85,
      "textual_features": 0.92,
      "combined_score": 0.89
    },
    "alternative_categories": [
      {
        "category": "agreement",
        "confidence": 0.76
      },
      {
        "category": "legal_document",
        "confidence": 0.68
      }
    ]
  }
}

Custom Model Integration

Use pre-trained or custom classification models:

// Configuration
// Algorithm: Custom Classifier
// Model Path: "models/custom_document_classifier.pkl"

// Input message:
{
  "payload": {
    "document_path": "documents/special_doc.pdf",
    "model_path": "models/custom_classifier.pkl",
    "categories": ["type_a", "type_b", "type_c"]
  }
}

Output Structure

Classification Results

{
  "document_path": "documents/invoice_001.pdf",
  "classification_results": {
    "predicted_category": "invoice",
    "confidence": 0.94,
    "processing_time": 1.2,
    "algorithm_used": "Deep Learning Classifier",
    "features_analyzed": {
      "visual_features": true,
      "textual_features": true,
      "layout_features": true
    }
  },
  "metadata": {
    "document_size": "2.3MB",
    "page_count": 1,
    "processing_timestamp": "2024-01-15T10:30:00Z"
  }
}

Multi-label Results

{
  "document_path": "documents/mixed_doc.pdf",
  "classification_results": {
    "primary_category": "invoice",
    "confidence": 0.89,
    "all_categories": [
      {
        "category": "invoice",
        "confidence": 0.89
      },
      {
        "category": "financial_document",
        "confidence": 0.82
      },
      {
        "category": "business_document",
        "confidence": 0.75
      }
    ],
    "multi_label_enabled": true
  }
}