AI Judge Block

The AI Judge block uses AI to evaluate whether generated content meets your quality standards. It acts as an automated quality gate, validating AI outputs against specified criteria before they proceed in your workflow.

Quality at Scale: AI Judge enables you to maintain consistent quality standards across thousands of AI-generated outputs without manual review of every response.

Key Features

Automated Quality Control - Evaluate AI responses against custom criteria
Multi-Model Support - Works with OpenAI, Gemini, Groq, and other providers
Custom Evaluation Rules - Define exactly what makes a response acceptable
Confidence Scoring - Get quantified confidence in pass/fail decisions
Detailed Feedback - Receive explanations for why content passed or failed
Multi-Criteria Assessment - Check multiple quality dimensions simultaneously
Conditional Routing - Branch workflows based on evaluation results

When to Use AI Judge

Use Case	Description
Customer Support QA	Validate response accuracy and tone before sending
Content Moderation	Check for inappropriate, harmful, or off-brand content
Compliance Checking	Ensure responses meet regulatory requirements
Brand Voice Validation	Verify content aligns with brand guidelines
Factual Accuracy	Cross-check AI claims against source data
Response Completeness	Confirm all required elements are present
Hallucination Detection	Flag responses that may contain made-up information

Configuration

Parameter	Type	Required	Default	Description
Model	Select	Yes	gpt-4	The evaluation model to use
Instructions	Text	Yes	-	Criteria for judging the response
Response to Judge	Text	Yes	-	The content to evaluate (usually a variable)
Context Data	Object	No	-	Additional context for evaluation
Scoring Mode	Select	No	Pass/Fail	Pass/Fail or Numeric Score (1-10)
Threshold	Number	No	7	Minimum score to pass (for numeric mode)
Require Feedback	Boolean	No	true	Generate detailed feedback

Setup Guide

Step 1: Add the AI Judge Block

Open your workflow in the editor
Navigate to Actions blocks
Drag the AI Judge block onto your canvas
Position it after the AI generation block you want to evaluate

Step 2: Configure Credentials

Select your AI provider from the dropdown
Choose credentials from your workspace settings
Select the evaluation model (GPT-4, Claude, Gemini recommended for complex evaluations)

Step 3: Write Evaluation Instructions

Craft clear, specific criteria:

Evaluate this customer support response. It PASSES if:
1. It directly answers the customer's question
2. The tone is professional and empathetic
3. It provides actionable next steps
4. It does NOT contain pricing promises or deadlines
5. It does NOT reveal internal processes

It FAILS if:
- It's evasive or doesn't address the concern
- It uses aggressive or condescending language
- It contains factual errors
- It makes unauthorized commitments

Step 4: Connect Response to Evaluate

Use a variable reference to the AI output: {{llmAgent.response}}
Optionally include context like the original query: {{user.question}}

Step 5: Configure Response Handling

Add a Condition block after AI Judge
Route based on aiJudge.passed boolean
Handle failed responses (regenerate, escalate, or flag)

Evaluation Criteria Examples

Customer Support Quality

PASS criteria:
- Addresses the customer's specific question
- Uses a warm, professional tone
- Provides clear next steps or resolution
- Offers to help further if needed
- Grammar and spelling are correct

FAIL criteria:
- Ignores the customer's actual question
- Uses jargon the customer wouldn't understand
- Sounds robotic or dismissive
- Contains factual errors about our products
- Makes promises about timelines or outcomes

Content Moderation

Evaluate for safety and appropriateness.

PASS if the content:
- Is family-friendly
- Contains no hate speech or discrimination
- Has no violent or disturbing imagery descriptions
- Respects privacy (no personal information)
- Is truthful and not misleading

FAIL if the content contains:
- Profanity or explicit language
- Harmful instructions or advice
- Harassment or bullying
- Misinformation or fake claims
- Copyright violations

Brand Voice Consistency

Check if this content matches our brand voice.

Our brand is: Friendly, Expert, Approachable, Innovative

PASS if:
- Uses "we" and "you" for personal connection
- Explains complex topics simply
- Avoids corporate jargon
- Includes specific examples or data
- Maintains positive, solution-oriented tone

FAIL if:
- Sounds generic or template-like
- Uses passive voice excessively
- Contains buzzwords without substance
- Feels impersonal or distant

Factual Accuracy Check

Cross-reference this response against the provided data.

Source data: {{context.sourceData}}

PASS if:
- All numbers and statistics match the source
- Dates and timeframes are accurate
- Product names and features are correct
- No information is fabricated

FAIL if:
- Any claim cannot be verified in the source
- Numbers don't match (even small discrepancies)
- Features are described that don't exist
- The response contradicts the source data

Output Variables

Variable	Type	Description
`is_followed`	boolean	True if the response meets criteria
`thinking`	string	Detailed explanation of the evaluation

Workflow Integration Patterns

Simple Pass/Fail Gate

Generate Response → AI Judge → Condition
                                  ├── Pass → Send to Customer
                                  └── Fail → Regenerate Response

Retry with Feedback

Generate Response → AI Judge → Condition
                                  ├── Pass → Continue
                                  └── Fail → Regenerate with Feedback
                                              ↓
                                        Include: "Previous attempt failed because: {{feedback}}"
                                              ↓
                                        AI Judge (Retry)
                                              ↓
                                        Condition
                                          ├── Pass → Continue
                                          └── Fail → Human In The Loop

Multi-Stage Evaluation

Response → Content Safety Judge → Brand Voice Judge → Factual Accuracy Judge → Approved
              ↓ Fail                  ↓ Fail               ↓ Fail
           Reject               Edit & Retry        Flag for Review

Model Selection Guide

Model	Best For	Speed	Cost
GPT-4o	Complex reasoning, nuanced criteria	Medium	High
GPT-3.5 Turbo	Simple checks, high volume	Fast	Low
Claude 3 Opus	Detailed analysis, long content	Medium	High
Claude 3 Haiku	Quick validations	Fast	Low
Gemini 1.5 Pro	Large context, document analysis	Medium	Medium
Groq (Llama 3)	Real-time applications	Very Fast	Low

Tip: Use faster, cheaper models for simple checks (profanity, length) and reserve powerful models for nuanced evaluations (tone, accuracy).

Best Practices

Be Specific in Criteria - Vague instructions lead to inconsistent results
Include Examples - Show what good and bad responses look like
Test with Edge Cases - Validate with tricky inputs before deployment
Use Appropriate Models - Match model capability to task complexity
Set Reasonable Thresholds - Too strict means too many false failures
Provide Context - Include original queries and relevant data
Log Everything - Keep records for analysis and improvement
Iterate on Criteria - Refine based on actual pass/fail patterns
Combine with Human Review - Use HITL for edge cases
Monitor Performance - Track pass rates and feedback quality

Error Handling

Error	Cause	Solution
Model unavailable	Selected model not accessible	Choose a different model
Rate limit exceeded	Too many evaluation requests	Add delays or use batch processing
Timeout	Response took too long	Simplify criteria or use faster model
Parse error	Couldn't extract evaluation result	Check instruction format
Context too long	Input exceeds model limit	Truncate or summarize content

Troubleshooting

Inconsistent Results:

Make criteria more specific and objective
Add examples of pass/fail responses
Consider using a more capable model

Too Many False Failures:

Relax overly strict criteria
Add exception cases to instructions
Lower the score threshold

Too Many False Passes:

Strengthen criteria requirements
Add specific failure conditions
Include edge case examples

Slow Evaluations:

Use a faster model for simple checks
Reduce context size
Consider parallel evaluations

Pro Tip: Chain multiple AI Judge blocks for different quality dimensions. Fail fast on critical criteria (safety) before checking nice-to-have criteria (style).

Generate Document PII