The Trinity Beast — Disaster Recovery Runbook

Overview & Quick Reference

This runbook covers every failure scenario for The Trinity Beast Infrastructure. Each scenario includes severity, detection method, impact, step-by-step recovery, and verification. All commands use the KCC or direct API calls.

First Response — Always: Run bash scripts/kcc.sh health to assess the current state before taking any action.

Scenario	Severity	Auto-Recovery?	RTO
Single container failure	Low	Yes — ECS auto-replaces	2-3 min
Full ECS cluster outage	Critical	Partial — ECS attempts recovery	5-10 min
Aurora failover	High	Yes — automatic failover	30-60 sec
Aurora data corruption	Critical	No — manual PITR required	15-30 min
ElastiCache failure	Medium	Partial — app degrades gracefully	5-10 min
All WebSocket feeds lost	Medium	Yes — auto-reconnect + REST fallback	1-2 min
SQS queue backup	Low	Yes — self-draining	Minutes
DNS failure	Critical	No — manual intervention	5-15 min
CloudFront / website down	Medium	Partial — CloudFront auto-heals	5-10 min
Lambda failure	Low	Yes — retries automatically	Immediate
WAF blocking legitimate traffic	High	No — manual rule adjustment	5 min
Admin key compromise	Critical	No — manual rotation	2 min
Full region failure	Critical	No — requires multi-region	Hours

Scenario 1: ECS Container Failure

LowOne of 3 containers stops responding

Detection: bash scripts/kcc.sh health shows 2/3 nodes. CloudWatch alarm TrinityBeast-API-5xx-Spike may fire.

Impact: Minimal. ALB routes traffic to remaining 2 containers. Each container has independent WebSocket feeds.

Auto-Recovery: ECS detects unhealthy task and launches replacement within 2-3 minutes.

1. Verify: bash scripts/kcc.sh health

2. If not auto-recovering, force deploy:

aws ecs update-service --cluster trinity-beast-fargate-cluster \
  --service <affected-service> --force-new-deployment --region us-east-2

3. Wait 40 seconds, verify: bash scripts/kcc.sh health

4. Force deploy params to sync the new container: bash scripts/kcc.sh force-deploy

Scenario 2: Full ECS Cluster Outage

CriticalAll 3 services are down

Detection: bash scripts/kcc.sh health shows 0/3 nodes. Both /health endpoints return errors.

Impact: Complete API outage. No price requests served. No admin endpoints available.

1. Check ECS service status:

for svc in trinity-beast-main-service trinity-beast-mirror-service trinity-beast-lrs-service; do
  echo "=== $svc ==="
  aws ecs describe-services --cluster trinity-beast-fargate-cluster \
    --services $svc --region us-east-2 \
    --query "services[0].{Status:status,Running:runningCount,Desired:desiredCount,Events:events[:3]}" \
    --output json
done

2. Force redeploy all 3: bash scripts/kcc.sh deploy-ecs (rebuilds and pushes fresh image)

3. If ECR image is the problem, check last known good image:

aws ecr describe-images --repository-name trinity-beast-lpo-server \
  --region us-east-2 --query "sort_by(imageDetails,&imagePushedAt)[-5:]" --output table

4. Wait 60 seconds, verify: bash scripts/kcc.sh health

5. Sync params: bash scripts/kcc.sh force-deploy

6. Full verification: bash scripts/kcc.sh verify

Scenario 3: Aurora Database Failover

HighAurora Serverless v2 automatic failover

Detection: Transient 5xx errors. Admin SQL queries fail briefly. CloudWatch RDS events.

Impact: 30-60 second disruption. Price requests still served from cache. New API key lookups and usage logging fail temporarily.

Auto-Recovery: Aurora handles failover automatically. Connection strings use the cluster endpoint which resolves to the new primary.

1. Check Aurora status:

aws rds describe-db-clusters --db-cluster-identifier trinity-beast-aurora-cluster \
  --region us-east-2 --query "DBClusters[0].{Status:Status,Endpoint:Endpoint}" --output table

2. Verify connectivity via admin SQL:

curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" -H "Content-Type: application/json" \
  -d '{"query":"SELECT 1 as alive","mode":"read"}' \
  https://api.cpmp-site.org/admin/sql

3. If connections are stale, force deploy to reset: bash scripts/kcc.sh deploy-ecs

4. Verify: bash scripts/kcc.sh health

Scenario 4: Aurora Data Corruption

CriticalData integrity issue requiring point-in-time recovery

Detection: Incorrect data in API responses. Admin SQL queries return unexpected results.

Impact: Potentially serving incorrect prices, wrong API key data, or corrupted usage logs.

CAUTION: Point-in-time recovery creates a NEW cluster. You must update the connection string in Secrets Manager and redeploy.

1. Identify the last known good time (check CloudWatch logs, admin SQL)

2. Initiate PITR:

aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier trinity-beast-aurora-cluster \
  --db-cluster-identifier trinity-beast-aurora-recovery \
  --restore-to-time "2026-05-02T12:00:00Z" \
  --region us-east-2

3. Create a DB instance in the new cluster:

aws rds create-db-instance \
  --db-instance-identifier trinity-beast-aurora-recovery-instance \
  --db-cluster-identifier trinity-beast-aurora-recovery \
  --db-instance-class db.serverless \
  --engine aurora-postgresql \
  --region us-east-2

4. Update Secrets Manager with new endpoint:

aws secretsmanager update-secret --secret-id trinity-beast-secrets \
  --secret-string '{"host":"NEW_ENDPOINT","port":"5432",...}' \
  --region us-east-2

5. Redeploy ECS to pick up new connection: bash scripts/kcc.sh deploy-ecs

6. Verify data integrity via admin SQL

7. Delete old cluster once confirmed

Scenario 5: ElastiCache Failure / Flush

MediumElastiCache node failure or accidental flush

Detection: bash scripts/kcc.sh daily shows Valkey offline. Cache hit rates drop. Latency increases.

Impact: Degraded performance. Price requests fall through to REST APIs. API key lookups fall through to Aurora. System continues to function — ElastiCache is a cache, not a primary store.

1. Check ElastiCache status:

aws elasticache describe-cache-clusters --cache-cluster-id trinity-beast-cache-001 \
  --region us-east-2 --output table

2. If node is down, wait for auto-recovery (ElastiCache replaces failed nodes)

3. Once back, repopulate caches:

# Force deploy params (reloads app config to ElastiCache)
bash scripts/kcc.sh force-deploy

# Trigger nightly sync manually to repopulate API keys and usage data
aws ecs run-task --cluster trinity-beast-fargate-cluster \
  --task-definition trinity-beast-sync-job \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx],assignPublicIp=ENABLED}" \
  --region us-east-2

4. WebSocket feeds will repopulate price cache within seconds of container restart

5. Verify: bash scripts/kcc.sh feed-status

Scenario 6: WebSocket Feed Loss (All Exchanges)

MediumAll 6 exchange WebSocket connections lost

Detection: bash scripts/kcc.sh feed-status shows all feeds disconnected. Stale counts increase.

Impact: Prices served from ElastiCache (may become stale). REST fallback activates for cache misses. Kraken batch prewarm continues independently.

Auto-Recovery: Each WebSocket feed has automatic reconnection with exponential backoff. Typically reconnects within 30-60 seconds.

1. Check feed status: bash scripts/kcc.sh feed-status

2. If feeds don't auto-reconnect within 2 minutes, force redeploy:

bash scripts/kcc.sh deploy-ecs

3. After containers restart, feeds reconnect automatically

4. Verify: bash scripts/kcc.sh feed-status — all 6 should show Connected

Scenario 7: SQS Queue Backup / Failure

LowUsage log queue backing up or SQS service issue

Detection: bash scripts/kcc.sh daily shows SQS pending > 0. CloudWatch SQS metrics show growing queue.

Impact: Zero impact on price serving. Usage logs are delayed but not lost (SQS retains messages for 4 days). Price requests continue at full speed.

1. Check queue depth:

aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-2.amazonaws.com/211998422884/trinity-beast-queued-usage-logs \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible \
  --region us-east-2

2. If consumer is stuck, redeploy ECS to restart consumers: bash scripts/kcc.sh deploy-ecs

3. Queue will self-drain as consumers process backlog

4. Monitor until queue returns to 0

Scenario 8: DNS / Route 53 Failure

Criticalapi.cpmp-site.org or lrs.cpmp-site.org not resolving

Detection: dig api.cpmp-site.org returns no results. All endpoints unreachable by domain name.

Impact: Complete outage for all clients using domain names. Direct IP access may still work.

1. Verify DNS resolution: dig api.cpmp-site.org

2. Check Route 53 hosted zone:

aws route53 list-resource-record-sets \
  --hosted-zone-id <ZONE_ID> \
  --query "ResourceRecordSets[?Name=='api.cpmp-site.org.']"

3. If records are missing, recreate A/AAAA records pointing to ALB

4. DNS propagation takes 60-300 seconds

5. Verify: bash scripts/kcc.sh health

Scenario 9: CloudFront / Website Down

Mediumcpmp-site.org website not loading

Detection: Website returns errors. CloudFront distribution shows issues.

Impact: Website down. API is unaffected (served by ALB, not CloudFront). TBCC may be inaccessible.

1. Check CloudFront distribution:

aws cloudfront get-distribution --id E110PRKEIYQVLL \
  --query "Distribution.{Status:Status,DomainName:DomainName}" --output table

2. If S3 origin is the issue, check bucket:

aws s3 ls s3://trinity-beast-website-east2/ --region us-east-2 | head -5

3. Invalidate CloudFront cache: aws cloudfront create-invalidation --distribution-id E110PRKEIYQVLL --paths "/*" --region us-east-1

4. If files are missing, redeploy: bash scripts/kcc.sh deploy-site cpmp-redesign/

Scenario 10: Lambda Failure (Receipt Processing)

Lowtrinity-beast-receipt Lambda failing

Detection: Stripe webhooks failing. Post-checkout receipts not sent. CloudWatch Lambda errors.

Impact: Customers don't receive receipts. LRS add-on activation delayed. Subscriptions still process (Stripe handles payment).

1. Check Lambda status:

aws lambda get-function --function-name trinity-beast-receipt \
  --region us-east-2 --query "Configuration.{State:State,LastModified:LastModified}" --output table

2. Check recent errors:

aws logs filter-log-events --log-group-name "/aws/lambda/trinity-beast-receipt" \
  --start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") \
  --filter-pattern "ERROR" --limit 10 --region us-east-2 --query "events[*].message" --output text

3. If code issue, redeploy Lambda from trinity-beast-receipt-lambda/

4. Stripe will retry failed webhooks automatically for up to 3 days

Scenario 11: WAF Blocking Legitimate Traffic

HighWAF rules blocking valid API requests

Detection: Customers report 403 errors. bash scripts/kcc.sh security shows high block rate.

1. Check WAF metrics: bash scripts/kcc.sh security

2. Identify which rule is blocking:

aws wafv2 get-sampled-requests --web-acl-arn <ARN> \
  --rule-metric-name <RULE> --scope REGIONAL \
  --time-window StartTime=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ),EndTime=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --max-items 10 --region us-east-2

3. If rate limit is too aggressive, adjust via AWS Console or CLI

4. For emergency: temporarily set rule action to COUNT instead of BLOCK

Scenario 12: Admin Key Compromise

CriticalAdmin API key exposed or compromised

Detection: Unexpected admin API calls in logs. Unauthorized configuration changes.

Impact: Attacker can read config, query database, modify parameters, flush caches.

IMMEDIATE ACTION REQUIRED — Rotate the key within minutes.

1. Generate a new admin key (any strong random string)

2. Update in Aurora:

curl -s "https://api.cpmp-site.org/admin/bootstrap-key?key=NEW_KEY_HERE"

3. Update KCC steering file: .kiro/steering/kiro-command-center.md

4. Update KCC script: scripts/kcc.sh (ADMIN_KEY variable)

5. Update TBCC: cpmp-redesign/admin/trinity-beast-command-center.html

6. Force deploy params to propagate: bash scripts/kcc.sh force-deploy

7. Review CloudWatch logs for unauthorized access during the exposure window

Scenario 13: Full Region Failure (us-east-2)

CriticalAWS us-east-2 (Ohio) region is down

Detection: All services unreachable. AWS Health Dashboard shows regional issue.

Impact: Complete outage. No API, no website (CloudFront may serve cached pages), no database.

Current posture: The Trinity Beast is single-region (us-east-2). Full region failure requires waiting for AWS to restore the region. Multi-region failover is a future enhancement.

1. Monitor AWS Health Dashboard

2. CloudFront (global) may continue serving cached website pages

3. Communicate outage to subscribers if prolonged

4. Once region recovers, run full verification: bash scripts/kcc.sh verify

5. Force deploy params: bash scripts/kcc.sh force-deploy

6. Check sync job: bash scripts/kcc.sh sync-check

Future mitigation: Consider Aurora Global Database (cross-region read replica) and a standby ECS cluster in us-east-1 for critical-path failover.

Emergency Contacts & Resources

Resource	Location
KCC Health Check	`bash scripts/kcc.sh health`
KCC Full Verification	`bash scripts/kcc.sh verify`
KCC Force Deploy	`bash scripts/kcc.sh force-deploy`
KCC Security Dashboard	`bash scripts/kcc.sh security`
AWS Health Dashboard	health.aws.amazon.com
AWS Console (us-east-2)	us-east-2.console.aws.amazon.com
TBCC (Browser Console)	cpmp-site.org/admin/trinity-beast-command-center.html
Stripe Dashboard	dashboard.stripe.com
Admin API Key	Stored in `.kiro/steering/kiro-command-center.md` and `scripts/kcc.sh`
AWS Secrets	`trinity-beast-secrets` in Secrets Manager (us-east-2)