AI Judge Block
The AI Judge block uses AI to evaluate whether generated content meets your quality standards. It acts as an automated quality gate, validating AI outputs against specified criteria before they proceed in your workflow.
Quality at Scale: AI Judge enables you to maintain consistent quality standards across thousands of AI-generated outputs without manual review of every response.
Key Features
- Automated Quality Control - Evaluate AI responses against custom criteria
- Multi-Model Support - Works with OpenAI, Gemini, Groq, and other providers
- Custom Evaluation Rules - Define exactly what makes a response acceptable
- Confidence Scoring - Get quantified confidence in pass/fail decisions
- Detailed Feedback - Receive explanations for why content passed or failed
- Multi-Criteria Assessment - Check multiple quality dimensions simultaneously
- Conditional Routing - Branch workflows based on evaluation results
When to Use AI Judge
| Use Case | Description |
|---|---|
| Customer Support QA | Validate response accuracy and tone before sending |
| Content Moderation | Check for inappropriate, harmful, or off-brand content |
| Compliance Checking | Ensure responses meet regulatory requirements |
| Brand Voice Validation | Verify content aligns with brand guidelines |
| Factual Accuracy | Cross-check AI claims against source data |
| Response Completeness | Confirm all required elements are present |
| Hallucination Detection | Flag responses that may contain made-up information |
Configuration
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| Model | Select | Yes | gpt-4 | The evaluation model to use |
| Instructions | Text | Yes | - | Criteria for judging the response |
| Response to Judge | Text | Yes | - | The content to evaluate (usually a variable) |
| Context Data | Object | No | - | Additional context for evaluation |
| Scoring Mode | Select | No | Pass/Fail | Pass/Fail or Numeric Score (1-10) |
| Threshold | Number | No | 7 | Minimum score to pass (for numeric mode) |
| Require Feedback | Boolean | No | true | Generate detailed feedback |
Setup Guide
Step 1: Add the AI Judge Block
- Open your workflow in the editor
- Navigate to Actions blocks
- Drag the AI Judge block onto your canvas
- Position it after the AI generation block you want to evaluate
Step 2: Configure Credentials
- Select your AI provider from the dropdown
- Choose credentials from your workspace settings
- Select the evaluation model (GPT-4, Claude, Gemini recommended for complex evaluations)
Step 3: Write Evaluation Instructions
Craft clear, specific criteria:
Evaluate this customer support response. It PASSES if:
1. It directly answers the customer's question
2. The tone is professional and empathetic
3. It provides actionable next steps
4. It does NOT contain pricing promises or deadlines
5. It does NOT reveal internal processes
It FAILS if:
- It's evasive or doesn't address the concern
- It uses aggressive or condescending language
- It contains factual errors
- It makes unauthorized commitmentsStep 4: Connect Response to Evaluate
- Use a variable reference to the AI output:
{{llmAgent.response}} - Optionally include context like the original query:
{{user.question}}
Step 5: Configure Response Handling
- Add a Condition block after AI Judge
- Route based on
aiJudge.passedboolean - Handle failed responses (regenerate, escalate, or flag)
Evaluation Criteria Examples
Customer Support Quality
PASS criteria:
- Addresses the customer's specific question
- Uses a warm, professional tone
- Provides clear next steps or resolution
- Offers to help further if needed
- Grammar and spelling are correct
FAIL criteria:
- Ignores the customer's actual question
- Uses jargon the customer wouldn't understand
- Sounds robotic or dismissive
- Contains factual errors about our products
- Makes promises about timelines or outcomesContent Moderation
Evaluate for safety and appropriateness.
PASS if the content:
- Is family-friendly
- Contains no hate speech or discrimination
- Has no violent or disturbing imagery descriptions
- Respects privacy (no personal information)
- Is truthful and not misleading
FAIL if the content contains:
- Profanity or explicit language
- Harmful instructions or advice
- Harassment or bullying
- Misinformation or fake claims
- Copyright violationsBrand Voice Consistency
Check if this content matches our brand voice.
Our brand is: Friendly, Expert, Approachable, Innovative
PASS if:
- Uses "we" and "you" for personal connection
- Explains complex topics simply
- Avoids corporate jargon
- Includes specific examples or data
- Maintains positive, solution-oriented tone
FAIL if:
- Sounds generic or template-like
- Uses passive voice excessively
- Contains buzzwords without substance
- Feels impersonal or distantFactual Accuracy Check
Cross-reference this response against the provided data.
Source data: {{context.sourceData}}
PASS if:
- All numbers and statistics match the source
- Dates and timeframes are accurate
- Product names and features are correct
- No information is fabricated
FAIL if:
- Any claim cannot be verified in the source
- Numbers don't match (even small discrepancies)
- Features are described that don't exist
- The response contradicts the source dataOutput Variables
| Variable | Type | Description |
|---|---|---|
is_followed | boolean | True if the response meets criteria |
thinking | string | Detailed explanation of the evaluation |
Workflow Integration Patterns
Simple Pass/Fail Gate
Generate Response → AI Judge → Condition
├── Pass → Send to Customer
└── Fail → Regenerate ResponseRetry with Feedback
Generate Response → AI Judge → Condition
├── Pass → Continue
└── Fail → Regenerate with Feedback
↓
Include: "Previous attempt failed because: {{feedback}}"
↓
AI Judge (Retry)
↓
Condition
├── Pass → Continue
└── Fail → Human In The LoopMulti-Stage Evaluation
Response → Content Safety Judge → Brand Voice Judge → Factual Accuracy Judge → Approved
↓ Fail ↓ Fail ↓ Fail
Reject Edit & Retry Flag for ReviewModel Selection Guide
| Model | Best For | Speed | Cost |
|---|---|---|---|
| GPT-4o | Complex reasoning, nuanced criteria | Medium | High |
| GPT-3.5 Turbo | Simple checks, high volume | Fast | Low |
| Claude 3 Opus | Detailed analysis, long content | Medium | High |
| Claude 3 Haiku | Quick validations | Fast | Low |
| Gemini 1.5 Pro | Large context, document analysis | Medium | Medium |
| Groq (Llama 3) | Real-time applications | Very Fast | Low |
Tip: Use faster, cheaper models for simple checks (profanity, length) and reserve powerful models for nuanced evaluations (tone, accuracy).
Best Practices
- Be Specific in Criteria - Vague instructions lead to inconsistent results
- Include Examples - Show what good and bad responses look like
- Test with Edge Cases - Validate with tricky inputs before deployment
- Use Appropriate Models - Match model capability to task complexity
- Set Reasonable Thresholds - Too strict means too many false failures
- Provide Context - Include original queries and relevant data
- Log Everything - Keep records for analysis and improvement
- Iterate on Criteria - Refine based on actual pass/fail patterns
- Combine with Human Review - Use HITL for edge cases
- Monitor Performance - Track pass rates and feedback quality
Error Handling
| Error | Cause | Solution |
|---|---|---|
| Model unavailable | Selected model not accessible | Choose a different model |
| Rate limit exceeded | Too many evaluation requests | Add delays or use batch processing |
| Timeout | Response took too long | Simplify criteria or use faster model |
| Parse error | Couldn't extract evaluation result | Check instruction format |
| Context too long | Input exceeds model limit | Truncate or summarize content |
Troubleshooting
Inconsistent Results:
- Make criteria more specific and objective
- Add examples of pass/fail responses
- Consider using a more capable model
Too Many False Failures:
- Relax overly strict criteria
- Add exception cases to instructions
- Lower the score threshold
Too Many False Passes:
- Strengthen criteria requirements
- Add specific failure conditions
- Include edge case examples
Slow Evaluations:
- Use a faster model for simple checks
- Reduce context size
- Consider parallel evaluations
Pro Tip: Chain multiple AI Judge blocks for different quality dimensions. Fail fast on critical criteria (safety) before checking nice-to-have criteria (style).