Document Classifier Block

The Document Classifier block automatically categorizes documents into predefined types using AI. It analyzes document content, layout, and visual features to determine the document type, making it essential for automated document processing workflows.

Overview

The Document Classifier uses machine learning models to identify document types such as invoices, contracts, receipts, forms, reports, and more. It can process both text content and visual document features to provide accurate classification with confidence scores.

Configuration Options

Input Source

Document Source: Select from message payload, file upload, or specific property
Input Type:
- Document file (PDF, images)
- Extracted text
- Combined (text + visual features)

Classification Settings

Model Type: Choose classification model
- Standard Document Types
- Custom Trained Model
- Industry-specific Models
Confidence Threshold: Minimum confidence for classification (0.0 - 1.0)
Return Top N: Number of top predictions to return

Document Categories

Common document types supported:

Financial: Invoices, receipts, bank statements
Legal: Contracts, agreements, legal documents
Forms: Applications, surveys, questionnaires
Reports: Business reports, analysis documents
Correspondence: Letters, emails, memos
Identification: IDs, passports, licenses

Output Options

Single Prediction: Return only the top prediction
Multiple Predictions: Return ranked list of predictions
Include Confidence: Include confidence scores
Include Features: Return extracted features used for classification

Input Message Format

Document File Input

{
    payload: /* document buffer */,
    filename: "document.pdf",
    mimetype: "application/pdf"
}

Text Input

{
    payload: {
        text: "Invoice from ABC Company...",
        metadata: {
            pages: 2,
            words: 350
        }
    }
}

Combined Input

{
    payload: {
        text: "Document text content",
        image: /* document image buffer */,
        layout: /* layout information */
    }
}

Output Message Format

Single Prediction

{
    payload: {
        classification: {
            type: "invoice",
            confidence: 0.94,
            category: "financial"
        },
        input_info: {
            pages: 1,
            text_length: 245,
            has_tables: true
        }
    }
}

Multiple Predictions

{
    payload: {
        predictions: [
            {
                type: "invoice",
                confidence: 0.94,
                category: "financial"
            },
            {
                type: "receipt",
                confidence: 0.78,
                category: "financial"
            },
            {
                type: "contract",
                confidence: 0.23,
                category: "legal"
            }
        ],
        top_prediction: "invoice",
        features: {
            has_company_header: true,
            has_line_items: true,
            has_totals: true,
            layout_type: "structured"
        }
    }
}

Document Types

Financial Documents

Invoices: Bills and billing documents
Receipts: Purchase receipts and proof of payment
Bank Statements: Financial account statements
Purchase Orders: Procurement documents
Credit Notes: Credit and refund documents

Legal Documents

Contracts: Legal agreements and contracts
Terms of Service: Service agreements
Privacy Policies: Data protection documents
Legal Notices: Official legal communications

Business Forms

Applications: Various application forms
Surveys: Questionnaires and feedback forms
Registration Forms: Sign-up and registration documents
Compliance Forms: Regulatory and compliance documents

Reports & Analysis

Business Reports: Analytical reports
Financial Reports: Financial analysis documents
Research Papers: Academic and research documents
Presentations: Slide presentations and summaries

Common Use Cases

Automated Document Routing

Route different document types to appropriate processing workflows:

Document Input → Document Classifier → Switch Node → Type-specific Processing

Invoice Processing System

Identify invoices for specialized processing:

File Upload → Document Classifier → Invoice Processor → Entity Extraction

Multi-type Document Processing

Handle mixed document batches:

Batch Upload → Document Classifier → Route by Type → Process Accordingly

Document Validation

Verify document types match expected categories:

Expected Type → Document Classifier → Type Validation → Process or Reject

Best Practices

Quality Input: Provide clear, well-scanned documents for better accuracy
Confidence Thresholds: Set appropriate confidence levels for your use case
Fallback Handling: Handle uncertain classifications gracefully
Type-specific Routing: Use classification results to route to specialized processors
Model Selection: Choose appropriate models for your document domain

Integration Patterns

With Document Processing Pipeline

Document → Classifier → OCR (if needed) → Entity Extractor → Data Validation

With Conditional Processing

Document → Classifier → Switch → [Invoice Flow | Contract Flow | Form Flow]

With Quality Assurance

Document → Classifier → Confidence Check → Human Review (if low confidence)

With Batch Processing

Document Batch → Array Loop → Classifier → Group by Type → Process Batches

Confidence Score Interpretation

High Confidence (0.8 - 1.0)

Very reliable classification
Proceed with automated processing
Document clearly matches known patterns

Medium Confidence (0.5 - 0.8)

Reasonable classification
Consider additional validation
May benefit from human review

Low Confidence (0.0 - 0.5)

Uncertain classification
Recommend human review
Document may be edge case or new type

Error Handling

Common Issues

Unrecognized Document Types

Document type not in training data
Poor quality or corrupted document
Mixed or composite document types

Low Confidence Scores

Unusual document layout
Poor image quality
Ambiguous document content

Processing Errors

Unsupported file format
Corrupted or encrypted documents
Oversized documents

Solutions

// Error handling flow
{
    payload: {
        error: "classification_failed",
        message: "Document type could not be determined",
        confidence: 0.12,
        suggestions: ["manual_review", "image_enhancement"]
    }
}

Flow Examples

Smart Document Router

Upload → Document Classifier → Switch:
                               ├─ invoice → Invoice Processor
                               ├─ contract → Contract Analyzer
                               ├─ receipt → Receipt Processor
                               └─ unknown → Manual Review Queue

Quality-based Processing

Document → Classifier → Confidence Check:
                        ├─ High (>0.8) → Auto Process
                        ├─ Medium (0.5-0.8) → Quick Review
                        └─ Low (<0.5) → Full Review

Multi-step Validation

Document → Classifier → Expected Type Check → Process if Match
                                           → Flag if Mismatch

Performance Considerations

Document Size: Larger documents take longer to classify
Model Complexity: More sophisticated models provide better accuracy but slower processing
Batch Size: Process documents individually for real-time needs
Caching: Cache results for frequently processed document types

Tips

Start with broad categories and refine based on results
Use debug blocks to examine classification features
Monitor confidence distributions to tune thresholds
Combine with other blocks for comprehensive document understanding
Consider custom training for domain-specific document types

Use Document Classifier with OCR for text extraction and Entity Extractor for data extraction workflows.

Document Classifier

On this page