RAP Logo

Document Classifier

Automatically classify documents into predefined categories using AI

Document Classifier Block

The Document Classifier block automatically categorizes documents into predefined types using AI. It analyzes document content, layout, and visual features to determine the document type, making it essential for automated document processing workflows.

Overview

The Document Classifier uses machine learning models to identify document types such as invoices, contracts, receipts, forms, reports, and more. It can process both text content and visual document features to provide accurate classification with confidence scores.

Configuration Options

Input Source

  • Document Source: Select from message payload, file upload, or specific property
  • Input Type:
    • Document file (PDF, images)
    • Extracted text
    • Combined (text + visual features)

Classification Settings

  • Model Type: Choose classification model
    • Standard Document Types
    • Custom Trained Model
    • Industry-specific Models
  • Confidence Threshold: Minimum confidence for classification (0.0 - 1.0)
  • Return Top N: Number of top predictions to return

Document Categories

Common document types supported:

  • Financial: Invoices, receipts, bank statements
  • Legal: Contracts, agreements, legal documents
  • Forms: Applications, surveys, questionnaires
  • Reports: Business reports, analysis documents
  • Correspondence: Letters, emails, memos
  • Identification: IDs, passports, licenses

Output Options

  • Single Prediction: Return only the top prediction
  • Multiple Predictions: Return ranked list of predictions
  • Include Confidence: Include confidence scores
  • Include Features: Return extracted features used for classification

Input Message Format

Document File Input

{
    payload: /* document buffer */,
    filename: "document.pdf",
    mimetype: "application/pdf"
}

Text Input

{
    payload: {
        text: "Invoice from ABC Company...",
        metadata: {
            pages: 2,
            words: 350
        }
    }
}

Combined Input

{
    payload: {
        text: "Document text content",
        image: /* document image buffer */,
        layout: /* layout information */
    }
}

Output Message Format

Single Prediction

{
    payload: {
        classification: {
            type: "invoice",
            confidence: 0.94,
            category: "financial"
        },
        input_info: {
            pages: 1,
            text_length: 245,
            has_tables: true
        }
    }
}

Multiple Predictions

{
    payload: {
        predictions: [
            {
                type: "invoice",
                confidence: 0.94,
                category: "financial"
            },
            {
                type: "receipt",
                confidence: 0.78,
                category: "financial"
            },
            {
                type: "contract",
                confidence: 0.23,
                category: "legal"
            }
        ],
        top_prediction: "invoice",
        features: {
            has_company_header: true,
            has_line_items: true,
            has_totals: true,
            layout_type: "structured"
        }
    }
}

Document Types

Financial Documents

  • Invoices: Bills and billing documents
  • Receipts: Purchase receipts and proof of payment
  • Bank Statements: Financial account statements
  • Purchase Orders: Procurement documents
  • Credit Notes: Credit and refund documents
  • Contracts: Legal agreements and contracts
  • Terms of Service: Service agreements
  • Privacy Policies: Data protection documents
  • Legal Notices: Official legal communications

Business Forms

  • Applications: Various application forms
  • Surveys: Questionnaires and feedback forms
  • Registration Forms: Sign-up and registration documents
  • Compliance Forms: Regulatory and compliance documents

Reports & Analysis

  • Business Reports: Analytical reports
  • Financial Reports: Financial analysis documents
  • Research Papers: Academic and research documents
  • Presentations: Slide presentations and summaries

Common Use Cases

Automated Document Routing

Route different document types to appropriate processing workflows:

Document Input → Document Classifier → Switch Node → Type-specific Processing

Invoice Processing System

Identify invoices for specialized processing:

File Upload → Document Classifier → Invoice Processor → Entity Extraction

Multi-type Document Processing

Handle mixed document batches:

Batch Upload → Document Classifier → Route by Type → Process Accordingly

Document Validation

Verify document types match expected categories:

Expected Type → Document Classifier → Type Validation → Process or Reject

Best Practices

  1. Quality Input: Provide clear, well-scanned documents for better accuracy
  2. Confidence Thresholds: Set appropriate confidence levels for your use case
  3. Fallback Handling: Handle uncertain classifications gracefully
  4. Type-specific Routing: Use classification results to route to specialized processors
  5. Model Selection: Choose appropriate models for your document domain

Integration Patterns

With Document Processing Pipeline

Document → Classifier → OCR (if needed) → Entity Extractor → Data Validation

With Conditional Processing

Document → Classifier → Switch → [Invoice Flow | Contract Flow | Form Flow]

With Quality Assurance

Document → Classifier → Confidence Check → Human Review (if low confidence)

With Batch Processing

Document Batch → Array Loop → Classifier → Group by Type → Process Batches

Confidence Score Interpretation

High Confidence (0.8 - 1.0)

  • Very reliable classification
  • Proceed with automated processing
  • Document clearly matches known patterns

Medium Confidence (0.5 - 0.8)

  • Reasonable classification
  • Consider additional validation
  • May benefit from human review

Low Confidence (0.0 - 0.5)

  • Uncertain classification
  • Recommend human review
  • Document may be edge case or new type

Error Handling

Common Issues

Unrecognized Document Types

  • Document type not in training data
  • Poor quality or corrupted document
  • Mixed or composite document types

Low Confidence Scores

  • Unusual document layout
  • Poor image quality
  • Ambiguous document content

Processing Errors

  • Unsupported file format
  • Corrupted or encrypted documents
  • Oversized documents

Solutions

// Error handling flow
{
    payload: {
        error: "classification_failed",
        message: "Document type could not be determined",
        confidence: 0.12,
        suggestions: ["manual_review", "image_enhancement"]
    }
}

Flow Examples

Smart Document Router

Upload → Document Classifier → Switch:
                               ├─ invoice → Invoice Processor
                               ├─ contract → Contract Analyzer
                               ├─ receipt → Receipt Processor
                               └─ unknown → Manual Review Queue

Quality-based Processing

Document → Classifier → Confidence Check:
                        ├─ High (>0.8) → Auto Process
                        ├─ Medium (0.5-0.8) → Quick Review
                        └─ Low (<0.5) → Full Review

Multi-step Validation

Document → Classifier → Expected Type Check → Process if Match
                                           → Flag if Mismatch

Performance Considerations

  • Document Size: Larger documents take longer to classify
  • Model Complexity: More sophisticated models provide better accuracy but slower processing
  • Batch Size: Process documents individually for real-time needs
  • Caching: Cache results for frequently processed document types

Tips

  • Start with broad categories and refine based on results
  • Use debug blocks to examine classification features
  • Monitor confidence distributions to tune thresholds
  • Combine with other blocks for comprehensive document understanding
  • Consider custom training for domain-specific document types

Use Document Classifier with OCR for text extraction and Entity Extractor for data extraction workflows.