Troubleshooting Guide¶
Common issues and their solutions when operating Metis.
Application Errors¶
"No relevant documents found for this client"¶
Cause: The client_id doesn't match any indexed data in S3.
Fix:
1. Verify the client_id being sent: check x-client-id header or URL ?client= param
2. Verify S3 has the index: aws s3 ls s3://{bucket}/{client_id}/index/
3. If missing, run ingestion: python -m ingest.ingest --client {client_id} --source /path/to/docs
"Invalid host" / 400 Error¶
Cause: The Host header doesn't match the configured BASE_DOMAIN.
Fix: Ensure BASE_DOMAIN env var matches your domain. For localhost development, the backend automatically accepts localhost, 127.0.0.1, and 0.0.0.0.
500 Internal Server Error¶
Check CloudWatch logs:
Common causes: - LLM rate limit (429): Groq free tier exhausted. Wait for reset or switch to a smaller model / paid tier. - S3 permission error: ECS task role lacks S3 read permission. Check IAM policies. - SSM secret not found: The OPENAI_API_KEY or ADMIN_TOKEN SSM parameter doesn't exist or ECS execution role can't read it.
Streaming returns no tokens (empty answers)¶
Cause: Usually an LLM API error. The AnswerStreamExtractor can't find the "answer" field in the response.
Fix:
1. Check CloudWatch logs for LLM errors (rate limit, auth failure, model not found)
2. Test the non-streaming endpoint (POST /query) — errors are more visible there
3. Verify OPENAI_API_KEY, OPENAI_BASE_URL, and OPENAI_MODEL are correct
Answers reference wrong client's documents¶
Cause: Client routing misconfiguration.
Fix:
1. Check if DEMO_PICKER_HOSTS includes your subdomain (it should for demo mode)
2. Check if DEFAULT_DEMO_CLIENT is set — this is the fallback when no ?client= param is provided
3. Inspect the x-client-id header in browser dev tools (Network tab)
Deployment Errors¶
"exec format error" in ECS¶
Cause: Docker image architecture doesn't match Fargate platform. ARM64 image on x86 task, or vice versa.
Fix: The task definition has runtime_platform.cpu_architecture. Ensure it matches your build machine:
- Apple Silicon (M1/M2/M3) → ARM64 (default)
- Intel/AMD Linux → X86_64
ACM Certificate stuck in PENDING_VALIDATION¶
Cause: DNS validation CNAME not resolving, or CAA records blocking Amazon.
Fix: 1. Check if your domain's authoritative DNS has the validation CNAME:
2. If using external DNS (Vercel, Cloudflare), add the CNAME manually 3. Check CAA records:dig CAA yourdomain.com. Must include amazon.com (and ideally amazonaws.com, amazontrust.com, awstrust.com)
4. If cert shows FAILED with CAA_ERROR, delete it, wait 10 minutes for cache expiry, and request again
ECS task keeps restarting¶
Check logs for the specific error:
Common causes:
- Health check failing (app not starting within 60s start period)
- Out of memory (increase ecs_memory in terraform.tfvars)
- Missing environment variables
- S3 bucket doesn't exist
Docker build fails with "credential-desktop" error¶
Cause: Docker Desktop credential helper not in PATH (common with Colima).
Fix: Create a clean Docker config:
mkdir -p /tmp/docker-cfg
echo '{"auths": {}, "currentContext": "colima"}' > /tmp/docker-cfg/config.json
cp -r ~/.docker/contexts /tmp/docker-cfg/
DOCKER_CONFIG=/tmp/docker-cfg docker build -t myapp .
Terraform apply hangs¶
Cause: Usually waiting for ACM certificate validation (can take 10+ minutes if DNS is slow).
Fix: Check cert status separately:
If FAILED, see the ACM troubleshooting section above.Performance Issues¶
Slow first query after deployment¶
Cause: Cold start — downloading indexes from S3 + loading embedding model + loading reranker model.
Fix: Set WARMUP_CLIENTS to pre-load important clients on startup. First load takes 5-10 seconds; subsequent queries are <2 seconds.
Answers are generic / low quality¶
Possible causes:
1. Too few documents: Need 10-20+ documents with overlapping topics for good cross-document synthesis
2. Documents too large without structure: Add headings/sections so the chunker can segment them properly
3. Wrong model: The 8B model is good for summarization but may miss nuance. Try the 70B model for better reasoning.
4. Reranking disabled: Enable RAG_RERANKING_ENABLED=true for dramatically better source selection
Conflict detection not firing¶
Cause: The conditional gate only triggers when sources span different ages or types.
Fix: Ensure your corpus includes both current AND legacy documents covering the same topic. The files must be in directories that trigger legacy classification (e.g., old_docs/, archive/).
Monitoring¶
CloudWatch Logs¶
# Tail live logs
aws logs tail /ecs/{app_name} --follow --format short
# Search for errors
aws logs filter-log-events --log-group-name /ecs/{app_name} \
--filter-pattern "ERROR" --start-time $(date -d '1 hour ago' +%s000)
ECS Service Health¶
# Check deployment status
aws ecs describe-services --cluster {app_name} --services {app_name} \
--query 'services[0].deployments[0].[rolloutState,runningCount,desiredCount]'
# Check recent events
aws ecs describe-services --cluster {app_name} --services {app_name} \
--query 'services[0].events[0:5].[createdAt,message]' --output table
Cost Monitoring¶
- ECS Fargate: ~$30-40/month for 1 vCPU / 3 GB task running 24/7
- S3: Negligible (indexes are typically <100 MB total)
- ALB: ~$16/month base + per-request charges
- CloudWatch: ~$0.50/month for log storage
- LLM API: Depends on usage. Groq 8B: ~$0.05 per 1M tokens. Budget ~$1-5/month for demo use.