Groq Integration

What it does: Access ultra-fast AI inference using Groq's LPU (Language Processing Unit) technology for lightning-speed chat completions.

⚡

In simple terms: Groq is like having a supercharged AI engine. It runs the same models as other providers but at incredibly fast speeds, perfect for real-time applications.

When to Use This

Use Groq when you need:

✅ Ultra-fast response times (up to 10x faster than traditional GPUs)
✅ Real-time AI conversations without lag
✅ Cost-effective inference at scale
✅ Support for popular open-source models
✅ Low-latency applications (chat, voice, live support)

Example: Build a real-time customer support chatbot that responds instantly, or create live AI-powered voice assistants.

Key Features

Blazing Fast: Industry-leading inference speed
Open Source Models: Llama, Mixtral, Gemma, and more
Low Latency: Perfect for real-time applications
Cost-Effective: Competitive pricing for fast inference
OpenAI-Compatible API: Easy migration from OpenAI
High Throughput: Handle many requests simultaneously

Setup Guide

Step 1: Get Groq API Key

Go to console.groq.com (opens in a new tab) and sign up
Navigate to API Keys section
Click "Create API Key"
Give it a descriptive name
Copy and save your API key securely

Step 2: Configure the Block

Connection Settings:

Credentials: Select your Groq credentials from the dropdown or create new ones
- API Key: Your Groq API key
- The system will automatically fetch available models
Model: Select the AI model you want to use
- llama-3.3-70b-versatile: Latest Llama 3.3 model, great for general tasks
- llama-3.1-70b-versatile: Previous generation, still very capable
- mixtral-8x7b-32768: Excellent for complex reasoning
- gemma-7b-it: Compact model for faster responses
Messages: Configure the conversation
- System Message: Instructions that guide the AI's behavior
- User Messages: Add user queries and context
- Dialogue: Use conversation history from variables
Temperature: Control response randomness (0-2)
- 0: Deterministic, focused responses
- 1: Balanced creativity (default)
- 2: Maximum creativity and randomness
Response Mapping: Save AI responses to variables
- Map "Message content" to workflow variables
- Access total tokens used
- Store response for later use

Available Models

Llama 3 Family

llama-3.3-70b-versatile (Recommended):

Latest and most capable Llama model
Excellent reasoning and understanding
Fast inference on Groq's LPU
Use for: General chatbots, Q&A, content generation

llama-3.1-70b-versatile:

Previous generation, proven reliability
Great balance of speed and capability
Use for: Most general-purpose tasks

llama-3.1-8b-instant:

Smaller, faster variant
Lower latency for real-time needs
Use for: Quick responses, simple queries

Mixtral

mixtral-8x7b-32768:

Mixture of Experts architecture
Excellent for complex reasoning
Large context window (32K tokens)
Use for: Technical questions, code generation, analysis

Gemma

gemma-7b-it:

Google's open model
Compact and efficient
Use for: Resource-conscious applications

Message Configuration

System Message

Define how the AI should behave:

You are a helpful AI assistant. Provide clear, accurate, and concise responses.
Focus on being direct and informative.

User Messages

Add the user's query:

Use workflow variables: {{user_question}}
Combine context: Answer this based on {{context}}: {{question}}
Multi-turn: Previous answer: {{last_response}}. Follow-up: {{new_question}}

Dialogue History

Reference conversation history:

Select a dialogue variable storing past messages
Maintains conversation context
Enables natural multi-turn conversations

Common Use Cases

Real-Time Chatbot

Ultra-responsive conversational AI:

Model: llama-3.3-70b-versatile

System Message:

You are a friendly chatbot. Respond quickly and naturally to user messages.
Keep responses concise but helpful.

User Message: {{user_input}}
Temperature: 0.7
Why Groq: Lightning-fast responses create smooth conversations

Live Customer Support

Instant support responses:

Model: llama-3.1-8b-instant (for maximum speed)

System Message:

You are a customer support agent. Answer questions about our products
and services. Be helpful, professional, and efficient.

Product info: {{product_knowledge}}

User Message: {{customer_question}}
Temperature: 0.4
Why Groq: No lag between customer question and AI response

Code Assistant

Fast programming help:

Model: mixtral-8x7b-32768

System Message:

You are a coding assistant. Provide clear code examples with brief explanations.
Focus on best practices and working solutions.

User Message: {{code_question}}
Temperature: 0.2
Why Groq: Rapid code generation and explanations

Content Summarization

Quick document summaries:

Model: llama-3.3-70b-versatile

System Message:

You are a summarization expert. Extract key points and create concise summaries.
Maintain the main ideas while being brief.

User Message: Summarize this: {{document}}
Temperature: 0.3
Why Groq: Process long documents quickly

Voice Assistant Backend

Power voice-enabled AI:

Model: llama-3.1-8b-instant

System Message:

You are a voice assistant. Provide brief, natural-sounding responses
optimized for text-to-speech. Avoid long lists or complex formatting.

User Message: {{transcribed_speech}}
Temperature: 0.6
Why Groq: Minimal latency critical for voice interactions

Advanced Features

Stream Responses

Enable real-time streaming for even better UX:

Responses appear word-by-word as generated
User sees output immediately
Perfect for chat interfaces
Reduces perceived latency

Conversation Context

Maintain Context:

Store conversation in dialogue variable
Include relevant history in each request
Update history after each exchange

Example:

1. User: "What's the weather in Paris?"
2. AI: "I don't have real-time weather data..."
3. User: "Then tell me about the city"
4. AI: (Knows "the city" = Paris from context)

Response Mapping

Extract data from responses:

Field	Description
Message content	The AI-generated text response
Total tokens	Token count for usage tracking

Example:

Save response to {{ai_response}}
Track usage with {{token_count}}
Display {{ai_response}} to user
Log {{token_count}} for analytics

Performance Optimization

Model Selection

For Speed (< 100ms latency):

llama-3.1-8b-instant
Use when: Voice apps, real-time chat, instant feedback

For Quality (still fast, ~200ms):

llama-3.3-70b-versatile
Use when: Complex questions, detailed answers

For Reasoning (balanced):

mixtral-8x7b-32768
Use when: Technical content, code, analysis

Temperature Guidelines

Task Type	Recommended Temperature
Facts, data lookup	0 - 0.2
Customer support	0.3 - 0.5
General chat	0.6 - 0.8
Creative writing	0.9 - 1.2
Brainstorming	1.3 - 2.0

Message Optimization

Keep it concise:

Shorter prompts = faster responses
Be direct and specific
Remove unnecessary context

Before:

I was wondering if you could possibly help me understand
what the difference might be between machine learning and
deep learning, if that's okay?

After:

Explain the difference between machine learning and deep learning.

What You Get Back

Response includes:

Message Content: The AI's text response
Total Tokens: Number of tokens used
Latency: Processing time (typically <1 second)

Tips for Success

Leverage speed - Design workflows that benefit from fast responses
- Real-time chat
- Live support
- Interactive applications
Choose the right model - Match model to task
- Instant models: Speed-critical applications
- 70B models: Quality-critical applications
- Mixtral: Technical/complex tasks
Optimize prompts - Faster responses with better prompts
- Be concise and specific
- Remove fluff
- Use clear instructions
Monitor costs - Track token usage
- Map token counts to variables
- Set up usage alerts
- Optimize prompt length
Test streaming - Better UX for long responses
- Enable streaming where UI supports it
- Shows response as it's generated
- Reduces perceived wait time

Troubleshooting

Problem	Likely Cause	Solution
No models loading	Invalid API key	Verify API key at console.groq.com
Rate limit errors	Too many requests	Implement request queuing or upgrade plan
Incomplete responses	Token limit reached	Use models with larger context windows
Slow responses	Network issues	Check network, Groq is typically <1s
Empty response	Model overload	Retry or switch to different model

Best Practices

Design for speed - Build experiences that showcase fast inference
Use appropriate models - Don't use 70B models when 8B will do
Implement retries - Handle occasional rate limits gracefully
Cache when possible - Save common queries to reduce API calls
Monitor latency - Track response times to ensure performance
Test different models - Find the best speed/quality balance
Keep context minimal - Only include necessary conversation history

Groq vs Other Providers

Why Choose Groq:

⚡ Speed: Up to 10x faster than traditional GPU inference
💰 Cost: Competitive pricing for performance
🔓 Open Models: Access to leading open-source models
🔄 Compatibility: OpenAI-compatible API for easy migration

When to Use Alternatives:

Need proprietary models (GPT-4, Claude)
Require specific model fine-tuning
Need very large context windows (>32K tokens)

Pricing

Groq offers competitive per-token pricing:

Pay only for usage
No minimum commitments
Free tier available for testing
Volume discounts for scale

Check console.groq.com/settings/billing (opens in a new tab) for current rates.

OpenRouter LLM Agent