The Trinity Beast — Disaster Recovery Runbook

Operational procedures for every failure scenario — from routine to catastrophic

Version: 1.0 Updated: May 2026 Classification: Operational — Internal Use

Overview & Quick Reference

This runbook covers every failure scenario for The Trinity Beast Infrastructure. Each scenario includes severity, detection method, impact, step-by-step recovery, and verification. All commands use the KCC or direct API calls.

First Response — Always: Run bash scripts/kcc.sh health to assess the current state before taking any action.
ScenarioSeverityAuto-Recovery?RTO
Single container failureLowYes — ECS auto-replaces2-3 min
Full ECS cluster outageCriticalPartial — ECS attempts recovery5-10 min
Aurora failoverHighYes — automatic failover30-60 sec
Aurora data corruptionCriticalNo — manual PITR required15-30 min
ElastiCache failureMediumPartial — app degrades gracefully5-10 min
All WebSocket feeds lostMediumYes — auto-reconnect + REST fallback1-2 min
SQS queue backupLowYes — self-drainingMinutes
DNS failureCriticalNo — manual intervention5-15 min
CloudFront / website downMediumPartial — CloudFront auto-heals5-10 min
Lambda failureLowYes — retries automaticallyImmediate
WAF blocking legitimate trafficHighNo — manual rule adjustment5 min
Admin key compromiseCriticalNo — manual rotation2 min
Full region failureCriticalNo — requires multi-regionHours

Scenario 1: ECS Container Failure

LowOne of 3 containers stops responding

Detection: bash scripts/kcc.sh health shows 2/3 nodes. CloudWatch alarm TrinityBeast-API-5xx-Spike may fire.

Impact: Minimal. ALB routes traffic to remaining 2 containers. Each container has independent WebSocket feeds.

Auto-Recovery: ECS detects unhealthy task and launches replacement within 2-3 minutes.

1. Verify: bash scripts/kcc.sh health
2. If not auto-recovering, force deploy:
aws ecs update-service --cluster trinity-beast-fargate-cluster \
  --service <affected-service> --force-new-deployment --region us-east-2
3. Wait 40 seconds, verify: bash scripts/kcc.sh health
4. Force deploy params to sync the new container: bash scripts/kcc.sh force-deploy

Scenario 2: Full ECS Cluster Outage

CriticalAll 3 services are down

Detection: bash scripts/kcc.sh health shows 0/3 nodes. Both /health endpoints return errors.

Impact: Complete API outage. No price requests served. No admin endpoints available.

1. Check ECS service status:
for svc in trinity-beast-main-service trinity-beast-mirror-service trinity-beast-lrs-service; do
  echo "=== $svc ==="
  aws ecs describe-services --cluster trinity-beast-fargate-cluster \
    --services $svc --region us-east-2 \
    --query "services[0].{Status:status,Running:runningCount,Desired:desiredCount,Events:events[:3]}" \
    --output json
done
2. Force redeploy all 3: bash scripts/kcc.sh deploy-ecs (rebuilds and pushes fresh image)
3. If ECR image is the problem, check last known good image:
aws ecr describe-images --repository-name trinity-beast-lpo-server \
  --region us-east-2 --query "sort_by(imageDetails,&imagePushedAt)[-5:]" --output table
4. Wait 60 seconds, verify: bash scripts/kcc.sh health
5. Sync params: bash scripts/kcc.sh force-deploy
6. Full verification: bash scripts/kcc.sh verify

Scenario 3: Aurora Database Failover

HighAurora Serverless v2 automatic failover

Detection: Transient 5xx errors. Admin SQL queries fail briefly. CloudWatch RDS events.

Impact: 30-60 second disruption. Price requests still served from cache. New API key lookups and usage logging fail temporarily.

Auto-Recovery: Aurora handles failover automatically. Connection strings use the cluster endpoint which resolves to the new primary.

1. Check Aurora status:
aws rds describe-db-clusters --db-cluster-identifier trinity-beast-aurora-cluster \
  --region us-east-2 --query "DBClusters[0].{Status:Status,Endpoint:Endpoint}" --output table
2. Verify connectivity via admin SQL:
curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" -H "Content-Type: application/json" \
  -d '{"query":"SELECT 1 as alive","mode":"read"}' \
  https://api.cpmp-site.org/admin/sql
3. If connections are stale, force deploy to reset: bash scripts/kcc.sh deploy-ecs
4. Verify: bash scripts/kcc.sh health

Scenario 4: Aurora Data Corruption

CriticalData integrity issue requiring point-in-time recovery

Detection: Incorrect data in API responses. Admin SQL queries return unexpected results.

Impact: Potentially serving incorrect prices, wrong API key data, or corrupted usage logs.

CAUTION: Point-in-time recovery creates a NEW cluster. You must update the connection string in Secrets Manager and redeploy.
1. Identify the last known good time (check CloudWatch logs, admin SQL)
2. Initiate PITR:
aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier trinity-beast-aurora-cluster \
  --db-cluster-identifier trinity-beast-aurora-recovery \
  --restore-to-time "2026-05-02T12:00:00Z" \
  --region us-east-2
3. Create a DB instance in the new cluster:
aws rds create-db-instance \
  --db-instance-identifier trinity-beast-aurora-recovery-instance \
  --db-cluster-identifier trinity-beast-aurora-recovery \
  --db-instance-class db.serverless \
  --engine aurora-postgresql \
  --region us-east-2
4. Update Secrets Manager with new endpoint:
aws secretsmanager update-secret --secret-id trinity-beast-secrets \
  --secret-string '{"host":"NEW_ENDPOINT","port":"5432",...}' \
  --region us-east-2
5. Redeploy ECS to pick up new connection: bash scripts/kcc.sh deploy-ecs
6. Verify data integrity via admin SQL
7. Delete old cluster once confirmed

Scenario 5: ElastiCache Failure / Flush

MediumElastiCache node failure or accidental flush

Detection: bash scripts/kcc.sh daily shows Valkey offline. Cache hit rates drop. Latency increases.

Impact: Degraded performance. Price requests fall through to REST APIs. API key lookups fall through to Aurora. System continues to function — ElastiCache is a cache, not a primary store.

1. Check ElastiCache status:
aws elasticache describe-cache-clusters --cache-cluster-id trinity-beast-cache-001 \
  --region us-east-2 --output table
2. If node is down, wait for auto-recovery (ElastiCache replaces failed nodes)
3. Once back, repopulate caches:
# Force deploy params (reloads app config to ElastiCache)
bash scripts/kcc.sh force-deploy

# Trigger nightly sync manually to repopulate API keys and usage data
aws ecs run-task --cluster trinity-beast-fargate-cluster \
  --task-definition trinity-beast-sync-job \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx],assignPublicIp=ENABLED}" \
  --region us-east-2
4. WebSocket feeds will repopulate price cache within seconds of container restart
5. Verify: bash scripts/kcc.sh feed-status

Scenario 6: WebSocket Feed Loss (All Exchanges)

MediumAll 6 exchange WebSocket connections lost

Detection: bash scripts/kcc.sh feed-status shows all feeds disconnected. Stale counts increase.

Impact: Prices served from ElastiCache (may become stale). REST fallback activates for cache misses. Kraken batch prewarm continues independently.

Auto-Recovery: Each WebSocket feed has automatic reconnection with exponential backoff. Typically reconnects within 30-60 seconds.

1. Check feed status: bash scripts/kcc.sh feed-status
2. If feeds don't auto-reconnect within 2 minutes, force redeploy:
bash scripts/kcc.sh deploy-ecs
3. After containers restart, feeds reconnect automatically
4. Verify: bash scripts/kcc.sh feed-status — all 6 should show Connected

Scenario 7: SQS Queue Backup / Failure

LowUsage log queue backing up or SQS service issue

Detection: bash scripts/kcc.sh daily shows SQS pending > 0. CloudWatch SQS metrics show growing queue.

Impact: Zero impact on price serving. Usage logs are delayed but not lost (SQS retains messages for 4 days). Price requests continue at full speed.

1. Check queue depth:
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-2.amazonaws.com/211998422884/trinity-beast-queued-usage-logs \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible \
  --region us-east-2
2. If consumer is stuck, redeploy ECS to restart consumers: bash scripts/kcc.sh deploy-ecs
3. Queue will self-drain as consumers process backlog
4. Monitor until queue returns to 0

Scenario 8: DNS / Route 53 Failure

Criticalapi.cpmp-site.org or lrs.cpmp-site.org not resolving

Detection: dig api.cpmp-site.org returns no results. All endpoints unreachable by domain name.

Impact: Complete outage for all clients using domain names. Direct IP access may still work.

1. Verify DNS resolution: dig api.cpmp-site.org
2. Check Route 53 hosted zone:
aws route53 list-resource-record-sets \
  --hosted-zone-id <ZONE_ID> \
  --query "ResourceRecordSets[?Name=='api.cpmp-site.org.']"
3. If records are missing, recreate A/AAAA records pointing to ALB
4. DNS propagation takes 60-300 seconds
5. Verify: bash scripts/kcc.sh health

Scenario 9: CloudFront / Website Down

Mediumcpmp-site.org website not loading

Detection: Website returns errors. CloudFront distribution shows issues.

Impact: Website down. API is unaffected (served by ALB, not CloudFront). TBCC may be inaccessible.

1. Check CloudFront distribution:
aws cloudfront get-distribution --id E110PRKEIYQVLL \
  --query "Distribution.{Status:Status,DomainName:DomainName}" --output table
2. If S3 origin is the issue, check bucket:
aws s3 ls s3://trinity-beast-website-east2/ --region us-east-2 | head -5
3. Invalidate CloudFront cache: aws cloudfront create-invalidation --distribution-id E110PRKEIYQVLL --paths "/*" --region us-east-1
4. If files are missing, redeploy: bash scripts/kcc.sh deploy-site cpmp-redesign/

Scenario 10: Lambda Failure (Receipt Processing)

Lowtrinity-beast-receipt Lambda failing

Detection: Stripe webhooks failing. Post-checkout receipts not sent. CloudWatch Lambda errors.

Impact: Customers don't receive receipts. LRS add-on activation delayed. Subscriptions still process (Stripe handles payment).

1. Check Lambda status:
aws lambda get-function --function-name trinity-beast-receipt \
  --region us-east-2 --query "Configuration.{State:State,LastModified:LastModified}" --output table
2. Check recent errors:
aws logs filter-log-events --log-group-name "/aws/lambda/trinity-beast-receipt" \
  --start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") \
  --filter-pattern "ERROR" --limit 10 --region us-east-2 --query "events[*].message" --output text
3. If code issue, redeploy Lambda from trinity-beast-receipt-lambda/
4. Stripe will retry failed webhooks automatically for up to 3 days

Scenario 11: WAF Blocking Legitimate Traffic

HighWAF rules blocking valid API requests

Detection: Customers report 403 errors. bash scripts/kcc.sh security shows high block rate.

1. Check WAF metrics: bash scripts/kcc.sh security
2. Identify which rule is blocking:
aws wafv2 get-sampled-requests --web-acl-arn <ARN> \
  --rule-metric-name <RULE> --scope REGIONAL \
  --time-window StartTime=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ),EndTime=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --max-items 10 --region us-east-2
3. If rate limit is too aggressive, adjust via AWS Console or CLI
4. For emergency: temporarily set rule action to COUNT instead of BLOCK

Scenario 12: Admin Key Compromise

CriticalAdmin API key exposed or compromised

Detection: Unexpected admin API calls in logs. Unauthorized configuration changes.

Impact: Attacker can read config, query database, modify parameters, flush caches.

IMMEDIATE ACTION REQUIRED — Rotate the key within minutes.
1. Generate a new admin key (any strong random string)
2. Update in Aurora:
curl -s "https://api.cpmp-site.org/admin/bootstrap-key?key=NEW_KEY_HERE"
3. Update KCC steering file: .kiro/steering/kiro-command-center.md
4. Update KCC script: scripts/kcc.sh (ADMIN_KEY variable)
5. Update TBCC: cpmp-redesign/admin/trinity-beast-command-center.html
6. Force deploy params to propagate: bash scripts/kcc.sh force-deploy
7. Review CloudWatch logs for unauthorized access during the exposure window

Scenario 13: Full Region Failure (us-east-2)

CriticalAWS us-east-2 (Ohio) region is down

Detection: All services unreachable. AWS Health Dashboard shows regional issue.

Impact: Complete outage. No API, no website (CloudFront may serve cached pages), no database.

Current posture: The Trinity Beast is single-region (us-east-2). Full region failure requires waiting for AWS to restore the region. Multi-region failover is a future enhancement.
2. CloudFront (global) may continue serving cached website pages
3. Communicate outage to subscribers if prolonged
4. Once region recovers, run full verification: bash scripts/kcc.sh verify
5. Force deploy params: bash scripts/kcc.sh force-deploy
6. Check sync job: bash scripts/kcc.sh sync-check

Future mitigation: Consider Aurora Global Database (cross-region read replica) and a standby ECS cluster in us-east-1 for critical-path failover.

Emergency Contacts & Resources

ResourceLocation
KCC Health Checkbash scripts/kcc.sh health
KCC Full Verificationbash scripts/kcc.sh verify
KCC Force Deploybash scripts/kcc.sh force-deploy
KCC Security Dashboardbash scripts/kcc.sh security
AWS Health Dashboardhealth.aws.amazon.com
AWS Console (us-east-2)us-east-2.console.aws.amazon.com
TBCC (Browser Console)cpmp-site.org/admin/trinity-beast-command-center.html
Stripe Dashboarddashboard.stripe.com
Admin API KeyStored in .kiro/steering/kiro-command-center.md and scripts/kcc.sh
AWS Secretstrinity-beast-secrets in Secrets Manager (us-east-2)