Groq Integration
What it does: Access ultra-fast AI inference using Groq's LPU (Language Processing Unit) technology for lightning-speed chat completions.
In simple terms: Groq is like having a supercharged AI engine. It runs the same models as other providers but at incredibly fast speeds, perfect for real-time applications.
When to Use This
Use Groq when you need:
- ✅ Ultra-fast response times (up to 10x faster than traditional GPUs)
- ✅ Real-time AI conversations without lag
- ✅ Cost-effective inference at scale
- ✅ Support for popular open-source models
- ✅ Low-latency applications (chat, voice, live support)
Example: Build a real-time customer support chatbot that responds instantly, or create live AI-powered voice assistants.
Key Features
- Blazing Fast: Industry-leading inference speed
- Open Source Models: Llama, Mixtral, Gemma, and more
- Low Latency: Perfect for real-time applications
- Cost-Effective: Competitive pricing for fast inference
- OpenAI-Compatible API: Easy migration from OpenAI
- High Throughput: Handle many requests simultaneously
Setup Guide
Step 1: Get Groq API Key
- Go to console.groq.com (opens in a new tab) and sign up
- Navigate to API Keys section
- Click "Create API Key"
- Give it a descriptive name
- Copy and save your API key securely
Step 2: Configure the Block
Connection Settings:
-
Credentials: Select your Groq credentials from the dropdown or create new ones
- API Key: Your Groq API key
- The system will automatically fetch available models
-
Model: Select the AI model you want to use
- llama-3.3-70b-versatile: Latest Llama 3.3 model, great for general tasks
- llama-3.1-70b-versatile: Previous generation, still very capable
- mixtral-8x7b-32768: Excellent for complex reasoning
- gemma-7b-it: Compact model for faster responses
-
Messages: Configure the conversation
- System Message: Instructions that guide the AI's behavior
- User Messages: Add user queries and context
- Dialogue: Use conversation history from variables
-
Temperature: Control response randomness (0-2)
- 0: Deterministic, focused responses
- 1: Balanced creativity (default)
- 2: Maximum creativity and randomness
-
Response Mapping: Save AI responses to variables
- Map "Message content" to workflow variables
- Access total tokens used
- Store response for later use
Available Models
Llama 3 Family
llama-3.3-70b-versatile (Recommended):
- Latest and most capable Llama model
- Excellent reasoning and understanding
- Fast inference on Groq's LPU
- Use for: General chatbots, Q&A, content generation
llama-3.1-70b-versatile:
- Previous generation, proven reliability
- Great balance of speed and capability
- Use for: Most general-purpose tasks
llama-3.1-8b-instant:
- Smaller, faster variant
- Lower latency for real-time needs
- Use for: Quick responses, simple queries
Mixtral
mixtral-8x7b-32768:
- Mixture of Experts architecture
- Excellent for complex reasoning
- Large context window (32K tokens)
- Use for: Technical questions, code generation, analysis
Gemma
gemma-7b-it:
- Google's open model
- Compact and efficient
- Use for: Resource-conscious applications
Message Configuration
System Message
Define how the AI should behave:
You are a helpful AI assistant. Provide clear, accurate, and concise responses.
Focus on being direct and informative.User Messages
Add the user's query:
- Use workflow variables:
{{user_question}} - Combine context:
Answer this based on {{context}}: {{question}} - Multi-turn:
Previous answer: {{last_response}}. Follow-up: {{new_question}}
Dialogue History
Reference conversation history:
- Select a dialogue variable storing past messages
- Maintains conversation context
- Enables natural multi-turn conversations
Common Use Cases
Real-Time Chatbot
Ultra-responsive conversational AI:
- Model: llama-3.3-70b-versatile
- System Message:
You are a friendly chatbot. Respond quickly and naturally to user messages. Keep responses concise but helpful. - User Message:
{{user_input}} - Temperature: 0.7
- Why Groq: Lightning-fast responses create smooth conversations
Live Customer Support
Instant support responses:
- Model: llama-3.1-8b-instant (for maximum speed)
- System Message:
You are a customer support agent. Answer questions about our products and services. Be helpful, professional, and efficient. Product info: {{product_knowledge}} - User Message:
{{customer_question}} - Temperature: 0.4
- Why Groq: No lag between customer question and AI response
Code Assistant
Fast programming help:
- Model: mixtral-8x7b-32768
- System Message:
You are a coding assistant. Provide clear code examples with brief explanations. Focus on best practices and working solutions. - User Message:
{{code_question}} - Temperature: 0.2
- Why Groq: Rapid code generation and explanations
Content Summarization
Quick document summaries:
- Model: llama-3.3-70b-versatile
- System Message:
You are a summarization expert. Extract key points and create concise summaries. Maintain the main ideas while being brief. - User Message:
Summarize this: {{document}} - Temperature: 0.3
- Why Groq: Process long documents quickly
Voice Assistant Backend
Power voice-enabled AI:
- Model: llama-3.1-8b-instant
- System Message:
You are a voice assistant. Provide brief, natural-sounding responses optimized for text-to-speech. Avoid long lists or complex formatting. - User Message:
{{transcribed_speech}} - Temperature: 0.6
- Why Groq: Minimal latency critical for voice interactions
Advanced Features
Stream Responses
Enable real-time streaming for even better UX:
- Responses appear word-by-word as generated
- User sees output immediately
- Perfect for chat interfaces
- Reduces perceived latency
Conversation Context
Maintain Context:
- Store conversation in dialogue variable
- Include relevant history in each request
- Update history after each exchange
Example:
1. User: "What's the weather in Paris?"
2. AI: "I don't have real-time weather data..."
3. User: "Then tell me about the city"
4. AI: (Knows "the city" = Paris from context)Response Mapping
Extract data from responses:
| Field | Description |
|---|---|
| Message content | The AI-generated text response |
| Total tokens | Token count for usage tracking |
Example:
- Save response to
{{ai_response}} - Track usage with
{{token_count}} - Display
{{ai_response}}to user - Log
{{token_count}}for analytics
Performance Optimization
Model Selection
For Speed (< 100ms latency):
- llama-3.1-8b-instant
- Use when: Voice apps, real-time chat, instant feedback
For Quality (still fast, ~200ms):
- llama-3.3-70b-versatile
- Use when: Complex questions, detailed answers
For Reasoning (balanced):
- mixtral-8x7b-32768
- Use when: Technical content, code, analysis
Temperature Guidelines
| Task Type | Recommended Temperature |
|---|---|
| Facts, data lookup | 0 - 0.2 |
| Customer support | 0.3 - 0.5 |
| General chat | 0.6 - 0.8 |
| Creative writing | 0.9 - 1.2 |
| Brainstorming | 1.3 - 2.0 |
Message Optimization
Keep it concise:
- Shorter prompts = faster responses
- Be direct and specific
- Remove unnecessary context
Before:
I was wondering if you could possibly help me understand
what the difference might be between machine learning and
deep learning, if that's okay?After:
Explain the difference between machine learning and deep learning.What You Get Back
Response includes:
- Message Content: The AI's text response
- Total Tokens: Number of tokens used
- Latency: Processing time (typically <1 second)
Tips for Success
-
Leverage speed - Design workflows that benefit from fast responses
- Real-time chat
- Live support
- Interactive applications
-
Choose the right model - Match model to task
- Instant models: Speed-critical applications
- 70B models: Quality-critical applications
- Mixtral: Technical/complex tasks
-
Optimize prompts - Faster responses with better prompts
- Be concise and specific
- Remove fluff
- Use clear instructions
-
Monitor costs - Track token usage
- Map token counts to variables
- Set up usage alerts
- Optimize prompt length
-
Test streaming - Better UX for long responses
- Enable streaming where UI supports it
- Shows response as it's generated
- Reduces perceived wait time
Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| No models loading | Invalid API key | Verify API key at console.groq.com |
| Rate limit errors | Too many requests | Implement request queuing or upgrade plan |
| Incomplete responses | Token limit reached | Use models with larger context windows |
| Slow responses | Network issues | Check network, Groq is typically <1s |
| Empty response | Model overload | Retry or switch to different model |
Best Practices
- Design for speed - Build experiences that showcase fast inference
- Use appropriate models - Don't use 70B models when 8B will do
- Implement retries - Handle occasional rate limits gracefully
- Cache when possible - Save common queries to reduce API calls
- Monitor latency - Track response times to ensure performance
- Test different models - Find the best speed/quality balance
- Keep context minimal - Only include necessary conversation history
Groq vs Other Providers
Why Choose Groq:
- ⚡ Speed: Up to 10x faster than traditional GPU inference
- 💰 Cost: Competitive pricing for performance
- 🔓 Open Models: Access to leading open-source models
- 🔄 Compatibility: OpenAI-compatible API for easy migration
When to Use Alternatives:
- Need proprietary models (GPT-4, Claude)
- Require specific model fine-tuning
- Need very large context windows (>32K tokens)
Pricing
Groq offers competitive per-token pricing:
- Pay only for usage
- No minimum commitments
- Free tier available for testing
- Volume discounts for scale
Check console.groq.com/settings/billing (opens in a new tab) for current rates.