Document Question Answering

Extracts answers to specific questions from document images. It can identify and return answers with or without their location coordinates (bounding boxes) in the document.

Quick Start

To get started:

  • Choose a model option from the dropdown (output shape depends on the option)
  • Choose a trained model from the Model to use dropdown
  • Send document image via msg.payload.image_path
  • Send questions via msg.payload.questions (array of strings)
  • Optionally provide OCR data via msg.payload.words_and_bboxes
  • Receive answers in msg.payload

Configuration

Model to use (required)

Select a pre-trained model from the dropdown menu. Models must be trained beforehand using the document question answering trainer block.

Number of Answers (optional)

Number of answers to return per question. Default: 1

Example: 1, 3, 5

Threshold (optional, for some algorithms)

Confidence threshold (0 to 1) for filtering answers. Default: 0.5

Note: Lower values return more answers with potentially lower confidence

Common Input Format (All Algorithms)

msg.payload.image_path (string)

Relative path of the document image file on shared storage.

Example: "documents/invoice.png"

Supported formats: .png, .jpg, .jpeg (case insensitive)

msg.payload.questions (array)

Array of questions to ask about the document.

Example: ["What is the invoice number?", "What is the total amount?"]

msg.payload.words_and_bboxes (array, optional for some algorithms)

Optional OCR data containing words and bounding boxes. Can improve accuracy.

Format: [[[x1, y1, x2, y2], "word"], ...]

Example: [[[29, 23, 150, 45], "Invoice"], [[29, 50, 200, 70], "INV-001"]]

Low Infra - Returns BBox of Answers

Use this mode when you need the answer and its bounding box (for highlighting, validation, or UI overlays).

Document Question Answering configuration that returns answers with bounding boxes

Output shape (high level):

  • msg.payload = { "output": { "question": [ { "answer": "...", "score": 0.92, "bbox": [x1, y1, x2, y2] } ] } }

Low Infra - Does not Return BBox of Answers

Use this mode when you only need answer text and confidence.

Output shape (high level):

  • msg.payload = { "output": { "question": [ { "answer": "...", "score": 0.92 } ] } }
  • Some models may also include cordinates when available.

Moderate Infra - OCR Free

Use this mode when OCR data is not available. The block will work directly from the image.

Document Question Answering OCR-free configuration

Output shape (high level):

  • msg.payload = {"output": {"question": ["answer"]}}

Example

Input (msg.payload)

{
  "image_path": "documents/invoice.png",
  "questions": ["What is the invoice number?", "What is the total amount?"],
  "words_and_bboxes": [
    [[29, 23, 150, 45], "Invoice"],
    [[29, 50, 200, 70], "INV-001"]
  ]
}

Output (msg.payload)

{
  "output": {
    "What is the invoice number?": [
      { "answer": "INV-001", "score": 0.94, "bbox": [470, 100, 650, 150] }
    ],
    "What is the total amount?": [
      { "answer": "$500.00", "score": 0.91, "bbox": [220, 200, 350, 250] }
    ]
  }
}

Errors

When the block fails, it raises an error. Use a Catch block in your flow to handle failures and inspect the error payload.

Common mistakes

  • Empty questions array: msg.payload.questions must contain at least one question.
  • Missing image path: msg.payload.image_path is required and must point to a file on shared storage.
  • Wrong OCR structure: If provided, msg.payload.words_and_bboxes must follow [[[x1, y1, x2, y2], "word"], ...].
  • Threshold out of range: If you set a threshold, keep it between 0 and 1.

Best Practices

  • Use clear, well-scanned document images for better answer extraction
  • Formulate specific and precise questions (e.g., "What is the invoice number?" instead of "invoice?")
  • Provide OCR data (words_and_bboxes) when available to improve accuracy
  • Use OCR-free mode when OCR data is unavailable or preprocessing is difficult
  • Start with a higher threshold (0.7-0.8) for critical applications, lower (0.3-0.5) for better coverage
  • Use answer-with-coordinates mode when you need to highlight or verify answer locations visually
  • Test with multiple questions in one call to reduce processing time
  • Always validate answers in production applications, especially for critical data extraction