Workflows
Blocks
Actions
AI Judge

AI Judge Block

The AI Judge block uses AI to evaluate whether generated content meets your quality standards. It acts as an automated quality gate, validating AI outputs against specified criteria before they proceed in your workflow.

Quality at Scale: AI Judge enables you to maintain consistent quality standards across thousands of AI-generated outputs without manual review of every response.

Key Features

  • Automated Quality Control - Evaluate AI responses against custom criteria
  • Multi-Model Support - Works with OpenAI, Gemini, Groq, and other providers
  • Custom Evaluation Rules - Define exactly what makes a response acceptable
  • Confidence Scoring - Get quantified confidence in pass/fail decisions
  • Detailed Feedback - Receive explanations for why content passed or failed
  • Multi-Criteria Assessment - Check multiple quality dimensions simultaneously
  • Conditional Routing - Branch workflows based on evaluation results

When to Use AI Judge

Use CaseDescription
Customer Support QAValidate response accuracy and tone before sending
Content ModerationCheck for inappropriate, harmful, or off-brand content
Compliance CheckingEnsure responses meet regulatory requirements
Brand Voice ValidationVerify content aligns with brand guidelines
Factual AccuracyCross-check AI claims against source data
Response CompletenessConfirm all required elements are present
Hallucination DetectionFlag responses that may contain made-up information

Configuration

ParameterTypeRequiredDefaultDescription
ModelSelectYesgpt-4The evaluation model to use
InstructionsTextYes-Criteria for judging the response
Response to JudgeTextYes-The content to evaluate (usually a variable)
Context DataObjectNo-Additional context for evaluation
Scoring ModeSelectNoPass/FailPass/Fail or Numeric Score (1-10)
ThresholdNumberNo7Minimum score to pass (for numeric mode)
Require FeedbackBooleanNotrueGenerate detailed feedback

Setup Guide

Step 1: Add the AI Judge Block

  1. Open your workflow in the editor
  2. Navigate to Actions blocks
  3. Drag the AI Judge block onto your canvas
  4. Position it after the AI generation block you want to evaluate

Step 2: Configure Credentials

  1. Select your AI provider from the dropdown
  2. Choose credentials from your workspace settings
  3. Select the evaluation model (GPT-4, Claude, Gemini recommended for complex evaluations)

Step 3: Write Evaluation Instructions

Craft clear, specific criteria:

Evaluate this customer support response. It PASSES if:
1. It directly answers the customer's question
2. The tone is professional and empathetic
3. It provides actionable next steps
4. It does NOT contain pricing promises or deadlines
5. It does NOT reveal internal processes

It FAILS if:
- It's evasive or doesn't address the concern
- It uses aggressive or condescending language
- It contains factual errors
- It makes unauthorized commitments

Step 4: Connect Response to Evaluate

  1. Use a variable reference to the AI output: {{llmAgent.response}}
  2. Optionally include context like the original query: {{user.question}}

Step 5: Configure Response Handling

  1. Add a Condition block after AI Judge
  2. Route based on aiJudge.passed boolean
  3. Handle failed responses (regenerate, escalate, or flag)

Evaluation Criteria Examples

Customer Support Quality

PASS criteria:
- Addresses the customer's specific question
- Uses a warm, professional tone
- Provides clear next steps or resolution
- Offers to help further if needed
- Grammar and spelling are correct

FAIL criteria:
- Ignores the customer's actual question
- Uses jargon the customer wouldn't understand
- Sounds robotic or dismissive
- Contains factual errors about our products
- Makes promises about timelines or outcomes

Content Moderation

Evaluate for safety and appropriateness.

PASS if the content:
- Is family-friendly
- Contains no hate speech or discrimination
- Has no violent or disturbing imagery descriptions
- Respects privacy (no personal information)
- Is truthful and not misleading

FAIL if the content contains:
- Profanity or explicit language
- Harmful instructions or advice
- Harassment or bullying
- Misinformation or fake claims
- Copyright violations

Brand Voice Consistency

Check if this content matches our brand voice.

Our brand is: Friendly, Expert, Approachable, Innovative

PASS if:
- Uses "we" and "you" for personal connection
- Explains complex topics simply
- Avoids corporate jargon
- Includes specific examples or data
- Maintains positive, solution-oriented tone

FAIL if:
- Sounds generic or template-like
- Uses passive voice excessively
- Contains buzzwords without substance
- Feels impersonal or distant

Factual Accuracy Check

Cross-reference this response against the provided data.

Source data: {{context.sourceData}}

PASS if:
- All numbers and statistics match the source
- Dates and timeframes are accurate
- Product names and features are correct
- No information is fabricated

FAIL if:
- Any claim cannot be verified in the source
- Numbers don't match (even small discrepancies)
- Features are described that don't exist
- The response contradicts the source data

Output Variables

VariableTypeDescription
is_followedbooleanTrue if the response meets criteria
thinkingstringDetailed explanation of the evaluation

Workflow Integration Patterns

Simple Pass/Fail Gate

Generate Response → AI Judge → Condition
                                  ├── Pass → Send to Customer
                                  └── Fail → Regenerate Response

Retry with Feedback

Generate Response → AI Judge → Condition
                                  ├── Pass → Continue
                                  └── Fail → Regenerate with Feedback

                                        Include: "Previous attempt failed because: {{feedback}}"

                                        AI Judge (Retry)

                                        Condition
                                          ├── Pass → Continue
                                          └── Fail → Human In The Loop

Multi-Stage Evaluation

Response → Content Safety Judge → Brand Voice Judge → Factual Accuracy Judge → Approved
              ↓ Fail                  ↓ Fail               ↓ Fail
           Reject               Edit & Retry        Flag for Review

Model Selection Guide

ModelBest ForSpeedCost
GPT-4oComplex reasoning, nuanced criteriaMediumHigh
GPT-3.5 TurboSimple checks, high volumeFastLow
Claude 3 OpusDetailed analysis, long contentMediumHigh
Claude 3 HaikuQuick validationsFastLow
Gemini 1.5 ProLarge context, document analysisMediumMedium
Groq (Llama 3)Real-time applicationsVery FastLow

Tip: Use faster, cheaper models for simple checks (profanity, length) and reserve powerful models for nuanced evaluations (tone, accuracy).

Best Practices

  1. Be Specific in Criteria - Vague instructions lead to inconsistent results
  2. Include Examples - Show what good and bad responses look like
  3. Test with Edge Cases - Validate with tricky inputs before deployment
  4. Use Appropriate Models - Match model capability to task complexity
  5. Set Reasonable Thresholds - Too strict means too many false failures
  6. Provide Context - Include original queries and relevant data
  7. Log Everything - Keep records for analysis and improvement
  8. Iterate on Criteria - Refine based on actual pass/fail patterns
  9. Combine with Human Review - Use HITL for edge cases
  10. Monitor Performance - Track pass rates and feedback quality

Error Handling

ErrorCauseSolution
Model unavailableSelected model not accessibleChoose a different model
Rate limit exceededToo many evaluation requestsAdd delays or use batch processing
TimeoutResponse took too longSimplify criteria or use faster model
Parse errorCouldn't extract evaluation resultCheck instruction format
Context too longInput exceeds model limitTruncate or summarize content

Troubleshooting

Inconsistent Results:

  • Make criteria more specific and objective
  • Add examples of pass/fail responses
  • Consider using a more capable model

Too Many False Failures:

  • Relax overly strict criteria
  • Add exception cases to instructions
  • Lower the score threshold

Too Many False Passes:

  • Strengthen criteria requirements
  • Add specific failure conditions
  • Include edge case examples

Slow Evaluations:

  • Use a faster model for simple checks
  • Reduce context size
  • Consider parallel evaluations

Pro Tip: Chain multiple AI Judge blocks for different quality dimensions. Fail fast on critical criteria (safety) before checking nice-to-have criteria (style).

Indite Documentation v1.4.0
PrivacyTermsSupport