LLM Judge

Evaluates text with three modes: input scanning, output scanning, and response evaluation using Large Language Models.

Quick Start

To get started:

Choose a mode from Choose Operation
Provide input fields for that mode
Receive evaluation results in msg.payload

Configuration

Model to use (required)

Select an LLM model for judging/evaluation.

Input by Mode

Input Scanners

msg.payload.user_input (string) - text to scan
msg.payload.ban_substrings (array, optional) - required when ban-substring scanning is enabled

Output Scanners

msg.payload.model_output (string) - model response to scan
msg.payload.ban_substrings (array, optional) - required when ban-substring scanning is enabled

Judge Metric

msg.payload.user_input (string) - original user query
msg.payload.model_output (string) - model response to evaluate
msg.payload.retrieval_context (array, optional) - retrieved context passages
msg.payload.rubric (array, optional) - list of rubric items, each with score_range and instruction
msg.payload.evaluation_steps (array, optional) - ordered evaluation steps

Output by Mode

msg.payload contains an output field with the results.

Input Scanners / Output Scanners

msg.payload.output is an object with:

is_valid (boolean)
scan_results (object)

Judge Metric

msg.payload.output is an object with:

status (string)
evaluation_results (object)

Example

Input (msg.payload)

{
    "user_input": "Summarize the contract terms.",
    "model_output": "The contract lasts for 12 months and renews automatically.",
    "retrieval_context": ["Contract duration is 12 months with auto-renewal."]
}

Output (msg.payload)

{
    "output": {
        "status": "completed",
        "evaluation_results": {
            "g_eval": { "score": 0.78, "reason": "Relevant and consistent." },
            "context_relevancy": { "score": 0.92, "reason": "Matches retrieved context." }
        }
    }
}

Errors

When the block fails, it raises an error. Use a Catch block in your flow to handle failures and inspect the error payload.

Common mistakes

Missing required field: Provide user_input or model_output based on the selected mode.
Invalid type: Ensure input fields are strings and arrays are arrays.

Best Practices

Provide clear evaluation criteria for consistent results
Use appropriate models based on evaluation complexity
Test with known examples to calibrate expectations
Use LLM Judge for quality assurance in content generation workflows
Combine with human review for critical evaluations
Document evaluation criteria for reproducibility
Monitor evaluation consistency over time

LLM Judge

On this page