LLM Judge

Evaluates text with three modes: input scanning, output scanning, and response evaluation using Large Language Models.

Quick Start

To get started:

  • Choose a mode from Choose Operation
  • Provide input fields for that mode
  • Receive evaluation results in msg.payload

Configuration

Model to use (required)

Select an LLM model for judging/evaluation.

Input by Mode

Input Scanners

  • msg.payload.user_input (string) - text to scan
  • msg.payload.ban_substrings (array, optional) - required when ban-substring scanning is enabled

Output Scanners

  • msg.payload.model_output (string) - model response to scan
  • msg.payload.ban_substrings (array, optional) - required when ban-substring scanning is enabled

Judge Metric

  • msg.payload.user_input (string) - original user query
  • msg.payload.model_output (string) - model response to evaluate
  • msg.payload.retrieval_context (array, optional) - retrieved context passages
  • msg.payload.rubric (array, optional) - list of rubric items, each with score_range and instruction
  • msg.payload.evaluation_steps (array, optional) - ordered evaluation steps

Output by Mode

msg.payload contains an output field with the results.

Input Scanners / Output Scanners

msg.payload.output is an object with:

  • is_valid (boolean)
  • scan_results (object)

Judge Metric

msg.payload.output is an object with:

  • status (string)
  • evaluation_results (object)

Example

Input (msg.payload)

{
    "user_input": "Summarize the contract terms.",
    "model_output": "The contract lasts for 12 months and renews automatically.",
    "retrieval_context": ["Contract duration is 12 months with auto-renewal."]
}

Output (msg.payload)

{
    "output": {
        "status": "completed",
        "evaluation_results": {
            "g_eval": { "score": 0.78, "reason": "Relevant and consistent." },
            "context_relevancy": { "score": 0.92, "reason": "Matches retrieved context." }
        }
    }
}

Errors

When the block fails, it raises an error. Use a Catch block in your flow to handle failures and inspect the error payload.

Common mistakes

  • Missing required field: Provide user_input or model_output based on the selected mode.
  • Invalid type: Ensure input fields are strings and arrays are arrays.

Best Practices

  • Provide clear evaluation criteria for consistent results
  • Use appropriate models based on evaluation complexity
  • Test with known examples to calibrate expectations
  • Use LLM Judge for quality assurance in content generation workflows
  • Combine with human review for critical evaluations
  • Document evaluation criteria for reproducibility
  • Monitor evaluation consistency over time