LLM Context Indexer

Adds, updates, or deletes documents from vector or document stores, supporting multiple formats including PDF, WORD, MD, TXT, Excel, JSON, and images for RAG applications.

Supported Document Types

PDF / WORD

PDF/WORD document processing configuration (Part 1) PDF/WORD document processing configuration (Part 2) PDF/WORD document processing configuration (Part 3)

PDF and WORD documents support layout or markdown extraction, with an optional image RAG pass.

MD / TXT

Markdown and text files with configurable chunk size and overlap.

Excel

Excel unstructured mode configuration (Part 1) Excel unstructured mode configuration (Part 2) Excel pandas mode configuration

Unstructured or pandas style processing. You can choose a storage option to ingest the extracted content, or select "Return extraction directly" to bypass storage.

JSON

Ingest raw JSON documents (non-embedding by default) or optionally embed a chosen content field.

IMAGE

Image embedding ingestion (enabled via the Image RAG option).

Unified Collection Management

The indexer uses a unified collection approach where a single collection_name represents both text and image modalities for PDF/WORD documents. If the UI option for Image RAG is enabled, image embeddings are ingested for supported document types.

Embedding and reranking models

You can select embedding and reranking models from the options available in the editor UI. In general:

  • Larger context models tend to produce richer embeddings but can increase latency and cost.
  • If a token limit (or similar) is available, set it explicitly to match your retrieval needs.

Inputs and block configuration by document type

Common (all types unless noted)

msg.payload.document_paths string[] or string

required. PDF/WORD/MD/TXT/Excel/IMAGE: string[]. JSON: string or string[].

msg.payload.db_to_injest string

one of qdrant, mongodb, mongodb_atlas (not required for Excel "Return extraction directly").

msg.payload.collection_name string

required for storage-backed modes.

msg.payload.mongodb_uri string

required when db_to_injest is a DB-backed option.

msg.payload.mongodb_db_name / msg.payload.mongodb_db string

required when db_to_injest is a DB-backed option.

PDF / WORD

extraction_type string

layout_based | plain_markdown_based.

chunk_size, chunk_overlap number

for plain_markdown_based.

chunk_token_size string or number

for layout_based (integer or auto).

is_image_rag_required boolean

If true, also provide collection_name_for_image_rag and image_embedding_model_name.

MD / TXT

chunk_size, chunk_overlap number

integers.

Excel

consumption_type string

ingest_in_db | return_extraction_directly.

excel_mode string

unstructured | pandas.

db_to_injest string

If ingest_in_db: supply db_to_injest and the relevant collection/connection fields.

If return_extraction_directly: DB fields are ignored; the block returns processed content directly.

JSON

json_content_field string

when missing, null, "none", or "nan", JSON is treated as non-embedding (Mongo only) and raw docs are stored.

IMAGE

db_to_injest string

must be qdrant.

collection_name_for_image_rag string

required.

Outputs

msg.payload contains an output field with the ingestion results. Common keys in msg.payload.output:

status string

ok | error

document_type string

one of pdf, word, md, txt, excel, json, image

db string or null

qdrant | mongodb | mongodb_atlas | null (Excel direct)

failed_docs string[]

of filenames that failed

passed_docs object

object keyed by filename with counts:

  • PDF/WORD/MD/TXT/Excel(ingest_in_db): num_chunks and, if applicable, num_images
  • IMAGE: num_images
  • JSON non-embedding: num_chunks (number of raw docs stored)
metadata_keys object

object keyed by filename; each value is a representative metadata preview (JSON-safe values only)

extractions object

Excel return_extraction_directly: object keyed by filename with the extracted content (list of chunks for unstructured, HTML for pandas)

Example

Input (msg.payload)

{
  "document_paths": ["docs/policies/terms.pdf"],
  "db_to_injest": "qdrant",
  "collection_name": "policies",
  "extraction_type": "layout_based",
  "chunk_token_size": "auto"
}

Output (msg.payload)

{
  "output": {
    "status": "ok",
    "document_type": "pdf",
    "db": "qdrant",
    "failed_docs": [],
    "passed_docs": {
      "terms.pdf": {
        "num_chunks": 42,
        "num_images": 3
      }
    }
  }
}

Delete Mode

LLM Context Indexer - Delete documents configuration

Select Delete documents to remove specific files using msg.payload.document_names. The system uses the unified collection approach:

  • Base collection name: Provide collection_name to identify the logical collection
  • Automatic resolution: The system automatically routes the delete request to the correct storage for that collection

Select Delete collection to remove an entire collection:

  • Base collection name: Provide collection_name to identify the logical collection
  • Complete removal: Removes all data associated with the base collection name

Errors

When the block fails, it raises an error. Use a Catch block in your flow to handle failures and inspect the error payload.

Validation failures (missing paths, invalid configuration, etc.) raise an editor error. Ensure all numeric fields are either numbers or valid {{...}} expressions.

Common mistakes

  • Missing document_paths: Provide at least one path (or JSON content depending on mode).
  • DB fields don’t match the selected mode: For example, some modes require storage connection fields; Excel “Return extraction directly” ignores storage settings.
  • Invalid DB settings: Make sure connection values and collection values are set correctly for the selected mode.

Recommendations for Best Use

  • Set explicit token limits when available: If the UI exposes a token limit, set it to match your retrieval needs to control cost and latency.
  • Prefer non-embedding JSON unless needed: Only enable embedding-based indexing when you plan to retrieve by semantic similarity.
  • Keep collections cleanly separated: Use distinct collection_name values per project or dataset. Avoid reusing collection names across unrelated ingestions to prevent data mixing and maintain clear organizational boundaries.
  • Enable Image RAG selectively: For PDFs with images, only enable image RAG when you plan to retrieve by visual context (diagrams, charts, screenshots). Otherwise, skip image embedding to save processing time and storage costs.
  • Use precise filters for non-embedding retrieval: Always pass a restrictive query/pipeline to avoid scanning full collections.
  • Choose chunk parameters conservatively: Start with moderate chunk_size and chunk_overlap values, then adjust based on retrieval quality.