Troubleshooting Guide¶

Common issues and their solutions when operating Metis.

Application Errors¶

"No relevant documents found for this client"¶

Cause: The client_id doesn't match any indexed data in S3.

Fix: 1. Verify the client_id being sent: check x-client-id header or URL ?client= param 2. Verify S3 has the index: aws s3 ls s3://{bucket}/{client_id}/index/ 3. If missing, run ingestion: python -m ingest.ingest --client {client_id} --source /path/to/docs

"Invalid host" / 400 Error¶

Cause: The Host header doesn't match the configured BASE_DOMAIN.

Fix: Ensure BASE_DOMAIN env var matches your domain. For localhost development, the backend automatically accepts localhost, 127.0.0.1, and 0.0.0.0.

500 Internal Server Error¶

Check CloudWatch logs:

aws logs tail /ecs/{app_name} --since 5m --format short

Common causes: - LLM rate limit (429): Groq free tier exhausted. Wait for reset or switch to a smaller model / paid tier. - S3 permission error: ECS task role lacks S3 read permission. Check IAM policies. - SSM secret not found: The OPENAI_API_KEY or ADMIN_TOKEN SSM parameter doesn't exist or ECS execution role can't read it.

Streaming returns no tokens (empty answers)¶

Cause: Usually an LLM API error. The AnswerStreamExtractor can't find the "answer" field in the response.

Fix: 1. Check CloudWatch logs for LLM errors (rate limit, auth failure, model not found) 2. Test the non-streaming endpoint (POST /query) — errors are more visible there 3. Verify OPENAI_API_KEY, OPENAI_BASE_URL, and OPENAI_MODEL are correct

Answers reference wrong client's documents¶

Cause: Client routing misconfiguration.

Fix: 1. Check if DEMO_PICKER_HOSTS includes your subdomain (it should for demo mode) 2. Check if DEFAULT_DEMO_CLIENT is set — this is the fallback when no ?client= param is provided 3. Inspect the x-client-id header in browser dev tools (Network tab)

Deployment Errors¶

"exec format error" in ECS¶

Cause: Docker image architecture doesn't match Fargate platform. ARM64 image on x86 task, or vice versa.

Fix: The task definition has runtime_platform.cpu_architecture. Ensure it matches your build machine: - Apple Silicon (M1/M2/M3) → ARM64 (default) - Intel/AMD Linux → X86_64

ACM Certificate stuck in PENDING_VALIDATION¶

Cause: DNS validation CNAME not resolving, or CAA records blocking Amazon.

Fix: 1. Check if your domain's authoritative DNS has the validation CNAME:

dig CNAME _xxx.yourdomain.com @8.8.8.8

2. If using external DNS (Vercel, Cloudflare), add the CNAME manually 3. Check CAA records: dig CAA yourdomain.com. Must include amazon.com (and ideally amazonaws.com, amazontrust.com, awstrust.com) 4. If cert shows FAILED with CAA_ERROR, delete it, wait 10 minutes for cache expiry, and request again

ECS task keeps restarting¶

Check logs for the specific error:

aws logs tail /ecs/{app_name} --since 15m --format short | tail -30

Common causes: - Health check failing (app not starting within 60s start period) - Out of memory (increase ecs_memory in terraform.tfvars) - Missing environment variables - S3 bucket doesn't exist

Docker build fails with "credential-desktop" error¶

Cause: Docker Desktop credential helper not in PATH (common with Colima).

Fix: Create a clean Docker config:

mkdir -p /tmp/docker-cfg
echo '{"auths": {}, "currentContext": "colima"}' > /tmp/docker-cfg/config.json
cp -r ~/.docker/contexts /tmp/docker-cfg/
DOCKER_CONFIG=/tmp/docker-cfg docker build -t myapp .

Terraform apply hangs¶

Cause: Usually waiting for ACM certificate validation (can take 10+ minutes if DNS is slow).

Fix: Check cert status separately:

aws acm list-certificates --query 'CertificateSummaryList[?contains(DomainName, `yourdomain`)]'

If FAILED, see the ACM troubleshooting section above.

Performance Issues¶

Slow first query after deployment¶

Cause: Cold start — downloading indexes from S3 + loading embedding model + loading reranker model.

Fix: Set WARMUP_CLIENTS to pre-load important clients on startup. First load takes 5-10 seconds; subsequent queries are <2 seconds.

Answers are generic / low quality¶

Possible causes: 1. Too few documents: Need 10-20+ documents with overlapping topics for good cross-document synthesis 2. Documents too large without structure: Add headings/sections so the chunker can segment them properly 3. Wrong model: The 8B model is good for summarization but may miss nuance. Try the 70B model for better reasoning. 4. Reranking disabled: Enable RAG_RERANKING_ENABLED=true for dramatically better source selection

Conflict detection not firing¶

Cause: The conditional gate only triggers when sources span different ages or types.

Fix: Ensure your corpus includes both current AND legacy documents covering the same topic. The files must be in directories that trigger legacy classification (e.g., old_docs/, archive/).

Monitoring¶

CloudWatch Logs¶

# Tail live logs
aws logs tail /ecs/{app_name} --follow --format short

# Search for errors
aws logs filter-log-events --log-group-name /ecs/{app_name} \
  --filter-pattern "ERROR" --start-time $(date -d '1 hour ago' +%s000)

ECS Service Health¶

# Check deployment status
aws ecs describe-services --cluster {app_name} --services {app_name} \
  --query 'services[0].deployments[0].[rolloutState,runningCount,desiredCount]'

# Check recent events
aws ecs describe-services --cluster {app_name} --services {app_name} \
  --query 'services[0].events[0:5].[createdAt,message]' --output table

Cost Monitoring¶

ECS Fargate: ~$30-40/month for 1 vCPU / 3 GB task running 24/7
S3: Negligible (indexes are typically <100 MB total)
ALB: ~$16/month base + per-request charges
CloudWatch: ~$0.50/month for log storage
LLM API: Depends on usage. Groq 8B: ~$0.05 per 1M tokens. Budget ~$1-5/month for demo use.