Operational procedures for every failure scenario — from routine to catastrophic
This runbook covers every failure scenario for The Trinity Beast Infrastructure. Each scenario includes severity, detection method, impact, step-by-step recovery, and verification. All commands use the KCC or direct API calls.
bash scripts/kcc.sh health to assess the current state before taking any action.
| Scenario | Severity | Auto-Recovery? | RTO |
|---|---|---|---|
| Single container failure | Low | Yes — ECS auto-replaces | 2-3 min |
| Full ECS cluster outage | Critical | Partial — ECS attempts recovery | 5-10 min |
| Aurora failover | High | Yes — automatic failover | 30-60 sec |
| Aurora data corruption | Critical | No — manual PITR required | 15-30 min |
| ElastiCache failure | Medium | Partial — app degrades gracefully | 5-10 min |
| All WebSocket feeds lost | Medium | Yes — auto-reconnect + REST fallback | 1-2 min |
| SQS queue backup | Low | Yes — self-draining | Minutes |
| DNS failure | Critical | No — manual intervention | 5-15 min |
| CloudFront / website down | Medium | Partial — CloudFront auto-heals | 5-10 min |
| Lambda failure | Low | Yes — retries automatically | Immediate |
| WAF blocking legitimate traffic | High | No — manual rule adjustment | 5 min |
| Admin key compromise | Critical | No — manual rotation | 2 min |
| Full region failure | Critical | No — requires multi-region | Hours |
Detection: bash scripts/kcc.sh health shows 2/3 nodes. CloudWatch alarm TrinityBeast-API-5xx-Spike may fire.
Impact: Minimal. ALB routes traffic to remaining 2 containers. Each container has independent WebSocket feeds.
Auto-Recovery: ECS detects unhealthy task and launches replacement within 2-3 minutes.
bash scripts/kcc.sh healthaws ecs update-service --cluster trinity-beast-fargate-cluster \
--service <affected-service> --force-new-deployment --region us-east-2
bash scripts/kcc.sh healthbash scripts/kcc.sh force-deployDetection: bash scripts/kcc.sh health shows 0/3 nodes. Both /health endpoints return errors.
Impact: Complete API outage. No price requests served. No admin endpoints available.
for svc in trinity-beast-main-service trinity-beast-mirror-service trinity-beast-lrs-service; do
echo "=== $svc ==="
aws ecs describe-services --cluster trinity-beast-fargate-cluster \
--services $svc --region us-east-2 \
--query "services[0].{Status:status,Running:runningCount,Desired:desiredCount,Events:events[:3]}" \
--output json
done
bash scripts/kcc.sh deploy-ecs (rebuilds and pushes fresh image)aws ecr describe-images --repository-name trinity-beast-lpo-server \
--region us-east-2 --query "sort_by(imageDetails,&imagePushedAt)[-5:]" --output table
bash scripts/kcc.sh healthbash scripts/kcc.sh force-deploybash scripts/kcc.sh verifyDetection: Transient 5xx errors. Admin SQL queries fail briefly. CloudWatch RDS events.
Impact: 30-60 second disruption. Price requests still served from cache. New API key lookups and usage logging fail temporarily.
Auto-Recovery: Aurora handles failover automatically. Connection strings use the cluster endpoint which resolves to the new primary.
aws rds describe-db-clusters --db-cluster-identifier trinity-beast-aurora-cluster \
--region us-east-2 --query "DBClusters[0].{Status:Status,Endpoint:Endpoint}" --output table
curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" -H "Content-Type: application/json" \
-d '{"query":"SELECT 1 as alive","mode":"read"}' \
https://api.cpmp-site.org/admin/sql
bash scripts/kcc.sh deploy-ecsbash scripts/kcc.sh healthDetection: Incorrect data in API responses. Admin SQL queries return unexpected results.
Impact: Potentially serving incorrect prices, wrong API key data, or corrupted usage logs.
aws rds restore-db-cluster-to-point-in-time \
--source-db-cluster-identifier trinity-beast-aurora-cluster \
--db-cluster-identifier trinity-beast-aurora-recovery \
--restore-to-time "2026-05-02T12:00:00Z" \
--region us-east-2
aws rds create-db-instance \
--db-instance-identifier trinity-beast-aurora-recovery-instance \
--db-cluster-identifier trinity-beast-aurora-recovery \
--db-instance-class db.serverless \
--engine aurora-postgresql \
--region us-east-2
aws secretsmanager update-secret --secret-id trinity-beast-secrets \
--secret-string '{"host":"NEW_ENDPOINT","port":"5432",...}' \
--region us-east-2
bash scripts/kcc.sh deploy-ecsDetection: bash scripts/kcc.sh daily shows Valkey offline. Cache hit rates drop. Latency increases.
Impact: Degraded performance. Price requests fall through to REST APIs. API key lookups fall through to Aurora. System continues to function — ElastiCache is a cache, not a primary store.
aws elasticache describe-cache-clusters --cache-cluster-id trinity-beast-cache-001 \
--region us-east-2 --output table
# Force deploy params (reloads app config to ElastiCache)
bash scripts/kcc.sh force-deploy
# Trigger nightly sync manually to repopulate API keys and usage data
aws ecs run-task --cluster trinity-beast-fargate-cluster \
--task-definition trinity-beast-sync-job \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx],assignPublicIp=ENABLED}" \
--region us-east-2
bash scripts/kcc.sh feed-statusDetection: bash scripts/kcc.sh feed-status shows all feeds disconnected. Stale counts increase.
Impact: Prices served from ElastiCache (may become stale). REST fallback activates for cache misses. Kraken batch prewarm continues independently.
Auto-Recovery: Each WebSocket feed has automatic reconnection with exponential backoff. Typically reconnects within 30-60 seconds.
bash scripts/kcc.sh feed-statusbash scripts/kcc.sh deploy-ecs
bash scripts/kcc.sh feed-status — all 6 should show ConnectedDetection: bash scripts/kcc.sh daily shows SQS pending > 0. CloudWatch SQS metrics show growing queue.
Impact: Zero impact on price serving. Usage logs are delayed but not lost (SQS retains messages for 4 days). Price requests continue at full speed.
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-2.amazonaws.com/211998422884/trinity-beast-queued-usage-logs \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible \
--region us-east-2
bash scripts/kcc.sh deploy-ecsDetection: dig api.cpmp-site.org returns no results. All endpoints unreachable by domain name.
Impact: Complete outage for all clients using domain names. Direct IP access may still work.
dig api.cpmp-site.orgaws route53 list-resource-record-sets \
--hosted-zone-id <ZONE_ID> \
--query "ResourceRecordSets[?Name=='api.cpmp-site.org.']"
bash scripts/kcc.sh healthDetection: Website returns errors. CloudFront distribution shows issues.
Impact: Website down. API is unaffected (served by ALB, not CloudFront). TBCC may be inaccessible.
aws cloudfront get-distribution --id E110PRKEIYQVLL \
--query "Distribution.{Status:Status,DomainName:DomainName}" --output table
aws s3 ls s3://trinity-beast-website-east2/ --region us-east-2 | head -5
aws cloudfront create-invalidation --distribution-id E110PRKEIYQVLL --paths "/*" --region us-east-1bash scripts/kcc.sh deploy-site cpmp-redesign/Detection: Stripe webhooks failing. Post-checkout receipts not sent. CloudWatch Lambda errors.
Impact: Customers don't receive receipts. LRS add-on activation delayed. Subscriptions still process (Stripe handles payment).
aws lambda get-function --function-name trinity-beast-receipt \
--region us-east-2 --query "Configuration.{State:State,LastModified:LastModified}" --output table
aws logs filter-log-events --log-group-name "/aws/lambda/trinity-beast-receipt" \
--start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") \
--filter-pattern "ERROR" --limit 10 --region us-east-2 --query "events[*].message" --output text
Detection: Customers report 403 errors. bash scripts/kcc.sh security shows high block rate.
bash scripts/kcc.sh securityaws wafv2 get-sampled-requests --web-acl-arn <ARN> \
--rule-metric-name <RULE> --scope REGIONAL \
--time-window StartTime=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ),EndTime=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--max-items 10 --region us-east-2
Detection: Unexpected admin API calls in logs. Unauthorized configuration changes.
Impact: Attacker can read config, query database, modify parameters, flush caches.
curl -s "https://api.cpmp-site.org/admin/bootstrap-key?key=NEW_KEY_HERE"
.kiro/steering/kiro-command-center.mdscripts/kcc.sh (ADMIN_KEY variable)cpmp-redesign/admin/trinity-beast-command-center.htmlbash scripts/kcc.sh force-deployDetection: All services unreachable. AWS Health Dashboard shows regional issue.
Impact: Complete outage. No API, no website (CloudFront may serve cached pages), no database.
bash scripts/kcc.sh verifybash scripts/kcc.sh force-deploybash scripts/kcc.sh sync-checkFuture mitigation: Consider Aurora Global Database (cross-region read replica) and a standby ECS cluster in us-east-1 for critical-path failover.
| Resource | Location |
|---|---|
| KCC Health Check | bash scripts/kcc.sh health |
| KCC Full Verification | bash scripts/kcc.sh verify |
| KCC Force Deploy | bash scripts/kcc.sh force-deploy |
| KCC Security Dashboard | bash scripts/kcc.sh security |
| AWS Health Dashboard | health.aws.amazon.com |
| AWS Console (us-east-2) | us-east-2.console.aws.amazon.com |
| TBCC (Browser Console) | cpmp-site.org/admin/trinity-beast-command-center.html |
| Stripe Dashboard | dashboard.stripe.com |
| Admin API Key | Stored in .kiro/steering/kiro-command-center.md and scripts/kcc.sh |
| AWS Secrets | trinity-beast-secrets in Secrets Manager (us-east-2) |