The Trinity Beast Infrastructure — CloudWatch Dashboard & Alarm Notifications

1. Overview

The Trinity Beast Infrastructure (TBI) uses Amazon CloudWatch as its centralized monitoring and alerting platform. This guide documents every dashboard, alarm, log group, and notification channel deployed across the system.

Dashboards

Alarms

Log Groups

Retention

30 days

4 CloudWatch dashboards for operational and cost visibility
14 alarms covering ECS, Aurora, ElastiCache, ALB, NLB, and S3
SNS topic delivers alerts via email and SMS simultaneously
10 log groups with 30-day retention across all services
Custom metrics published to TrinityBeast/LPO and TrinityBeast/LRS namespaces

2. Dashboards

Four CloudWatch dashboards provide layered visibility — from real-time application metrics to executive cost summaries.

Dashboard	Purpose
`Trinity-Beast-Application-Dashboard`	Primary ops dashboard — LPO, LRS, AWS infra, Lambda, logs
`Trinity-Beast-Master-Dashboard`	Comprehensive view across all services
`Trinity-Beast-Cost-Detailed-Dashboard`	Detailed cost breakdown by service
`Trinity-Beast-Cost-Executive-Dashboard`	Executive cost summary

3. Application Dashboard — Widget Reference

The Trinity-Beast-Application-Dashboard is the primary operational dashboard. It contains widgets organized into six sections covering every layer of the stack.

LPO Section

LPO Widgets 7 Widgets

Widget	Type
LPO Requests (per minute)	Metric — line graph
Cache Hit Rate (%)	Metric — gauge / number
Avg Latency (ms)	Metric — line graph
Cache Hits vs Misses	Metric — stacked area
Requests by Asset	Metric — bar chart
Requests by Source (Exchange)	Metric — bar chart
Errors & Source Failovers	Metric — line graph

LRS Section

LRS Widgets 4 Widgets

Widget	Type
LRS Total Requests	Metric — line graph
LRS Avg Latency (ms)	Metric — line graph
LRS Output Format Usage	Metric — bar chart
LRS Errors	Metric — line graph

AWS Infrastructure Section

Infrastructure Widgets 6 Widgets

Widget	Type
ECS CPU Utilization (%)	Metric — line graph
ECS Memory Utilization (%)	Metric — line graph
ALB Response Time & Errors	Metric — line graph
ElastiCache CPU & Cache Hit Rate	Metric — line graph
ElastiCache Memory Usage (%)	Metric — gauge / number
Aurora Serverless Capacity (ACU)	Metric — line graph

Container Logs Section

Log Widgets 4 Widgets

Widget	Type
LPO — Main Service Logs	Log query
LRS — Report Service Logs	Log query
Mirror Service Logs	Log query
Sync Job Logs	Log query

Lambda Section

Lambda Widgets 7 Widgets

Widget	Type
Lambda Invocations	Metric — line graph
Lambda Errors	Metric — line graph
Lambda Duration (ms)	Metric — line graph
Throttles & Concurrency	Metric — line graph
Receipts by Handler Type	Log widget
Recent Receipts — Handler Detail	Log widget
Receipt Lambda Logs	Log query

CloudTrail & VPC Section

Audit & Network Widgets 3 Widgets

Widget	Type
CloudTrail — Errors & Access Denied	Log query
CloudTrail — ECS & Infrastructure Changes	Log query
VPC Flow Logs — Rejected Traffic (Trinity VPC)	Log query

4. Cost Dashboards

Two dedicated cost dashboards provide financial visibility into the Trinity Beast Infrastructure spend.

Trinity-Beast-Cost-Detailed-Dashboard Detailed

Shows a per-service cost breakdown including ECS Fargate, Aurora Serverless, ElastiCache, Lambda, S3, CloudWatch, NAT Gateway, and data transfer. Each service is displayed with daily and monthly cost trends, making it easy to identify which component is driving spend.

Trinity-Beast-Cost-Executive-Dashboard Executive

Provides a high-level monthly cost summary with total spend, month-over-month trends, and projected costs. Designed for stakeholders who need a quick financial snapshot without per-service granularity.

5. CloudWatch Alarms

14 alarms monitor critical infrastructure metrics. All alarms publish to the Trinity-Beast-Critical-Alerts SNS topic, triggering both email and SMS notifications simultaneously.

Load Balancers (2 Alarms)

ALB & NLB Health OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-ALB-UnhealthyTargets`	UnHealthyHostCount	AWS/ApplicationELB	>= 1	60s	3	OK
`Trinity-Beast-NLB-UnhealthyTargets`	UnHealthyHostCount	AWS/NetworkELB	>= 1	60s	3	OK

ECS Services (6 Alarms)

ECS CPU & Task Count OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State	Notes
`Trinity-Beast-ECS-CPU-High`	CPUUtilization	AWS/ECS (main-service)	> 80%	300s	2	OK	—
`Trinity-Beast-ECS-CPU-High-Mirror`	CPUUtilization	AWS/ECS (mirror-service)	> 80%	300s	2	OK	—
`Trinity-Beast-ECS-CPU-High-LRS`	CPUUtilization	AWS/ECS (lrs-service)	> 80%	300s	2	OK	—
`Trinity-Beast-Main-Service-Count-Low`	RunningTaskCount	ECS/ContainerInsights (main)	< 1	300s	2	OK	TreatMissing: breaching
`Trinity-Beast-Mirror-Service-Count-Low`	RunningTaskCount	ECS/ContainerInsights (mirror)	< 1	300s	2	OK	TreatMissing: breaching
`Trinity-Beast-LRS-Service-Count-Low`	RunningTaskCount	ECS/ContainerInsights (lrs)	< 1	300s	2	OK	TreatMissing: breaching

Aurora (2 Alarms)

Aurora Serverless v2 OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-Aurora-CPU-High`	CPUUtilization	AWS/RDS (trinity-beast-aurora-cluster)	> 80%	300s	2	OK
`Trinity-Beast-Aurora-Connections-High`	DatabaseConnections	AWS/RDS (trinity-beast-aurora-cluster)	> 80	300s	2	OK

ElastiCache (3 Alarms)

ElastiCache for Valkey Mixed State

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-ElastiCache-CPU-High`	CPUUtilization	AWS/ElastiCache	> 80%	300s	2	OK
`Trinity-Beast-ElastiCache-Memory-High`	DatabaseMemoryUsagePercentage	AWS/ElastiCache	> 85%	300s	2	OK/ALARM
`Trinity-Beast-ElastiCache-Connections-High`	CurrConnections	AWS/ElastiCache	> 1000	300s	2	OK

S3 (1 Alarm)

S3 Bucket Size OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-S3-Size-Unusual-Growth`	BucketSizeBytes	AWS/S3	> 10 GB	86400s	1	OK

6. SNS Notifications

All 14 CloudWatch alarms route to a single SNS topic that delivers alerts through two channels simultaneously.

Trinity-Beast-Critical-Alerts SNS Topic

Topic Name

Trinity-Beast-Critical-Alerts

Subscriptions

Alarms Attached

Delivery

Simultaneous

Protocol	Endpoint	Status
Email	`Admin@CPMP-Site.org`	Confirmed
SMS	`+16156128200`	Confirmed

Behavior: When any of the 14 alarms transitions to ALARM state, both email and SMS notifications fire simultaneously. There is no escalation chain — both channels receive every alert.

7. CloudWatch Log Groups

10 log groups capture output from every service layer. All groups are configured with a 30-day retention policy.

Log Group	Retention	Source
`/aws/ecs/trinity-beast`	30 days	All 3 ECS services (LPO, Mirror, LRS)
`/aws/ecs/trinity-beast-sync`	30 days	Nightly sync job
`/ecs/trinity-beast-lpo`	30 days	Legacy LPO logs
`/ecs/trinity-beast-main-task-container-def`	30 days	Legacy main task logs
`/aws/lambda/trinity-beast-receipt`	30 days	Receipt Lambda
`/aws/vpc/trinity-beast-flowlogs`	30 days	VPC Flow Logs
`/aws/cloudtrail/trinity-beast`	30 days	CloudTrail audit logs
`/aws/codebuild/trinity-beast-build`	30 days	CodeBuild logs
`/aws/ecs/containerinsights/trinity-beast-fargate-cluster/performance`	30 days	Container Insights
`RDSOSMetrics`	30 days	Aurora OS metrics

8. Custom Metrics (TrinityBeast Namespace)

The application publishes custom metrics to two CloudWatch namespaces, providing business-level observability beyond standard AWS metrics.

TrinityBeast/LPO Custom Namespace

Metrics published by the Live Price Oracle service:

Metric	Description
`Requests`	Total LPO requests received
`CacheHits`	Requests served from ElastiCache cache
`CacheMisses`	Requests requiring upstream source fetch
`Errors`	Failed requests (all error types)
`SourceFailovers`	Times a primary source failed and secondary was used
`AvgLatency`	Average response time in milliseconds

TrinityBeast/LRS Custom Namespace

Metrics published by the Live Report Service:

Metric	Description
`Requests`	Total LRS report requests
`AvgLatency`	Average report generation time in milliseconds
`Errors`	Failed report generations
`MonthlyLimitExceeded`	Requests rejected due to monthly quota
`DailyLimitExceeded`	Requests rejected due to daily quota
`AddOnRequests`	Requests using add-on quota beyond base plan

9. Alarm Response Playbook

When an alarm fires, use the following runbooks to diagnose and resolve the issue. Each category includes the most common root causes and recommended actions.

ALB/NLB Unhealthy Targets Critical

Alarms: Trinity-Beast-ALB-UnhealthyTargets, Trinity-Beast-NLB-UnhealthyTargets

What it means: One or more ECS tasks are failing health checks from the load balancer.

Check ECS service health in the console — are tasks running or in a crash loop?
Review container logs in /aws/ecs/trinity-beast for startup errors or OOM kills
Verify target group health check path and expected response code
Check if a recent deployment introduced a breaking change
If tasks are running but unhealthy, check application health endpoint directly

ECS CPU High Warning

Alarms: Trinity-Beast-ECS-CPU-High, ECS-CPU-High-Mirror, ECS-CPU-High-LRS

What it means: An ECS service is consuming more than 80% CPU over a sustained period.

Check for a traffic spike — correlate with LPO/LRS request metrics on the Application Dashboard
Consider scaling the service — increase desired task count or adjust auto-scaling thresholds
Check for runaway goroutines or infinite loops in recent deployments
Review Container Insights for per-task CPU breakdown
If sustained, evaluate whether the task CPU allocation (vCPU) needs to be increased

Service Count Low Critical

Alarms: Trinity-Beast-Main-Service-Count-Low, Mirror-Service-Count-Low, LRS-Service-Count-Low

What it means: A container has crashed and no tasks are running for the service. These alarms use TreatMissing: breaching, so missing data also triggers the alarm.

Check ECS service events for task stopped reasons (OOM, exit code, health check failure)
Review container logs for the last running task — look for panic, fatal, or OOM messages
Check if the ECR image exists and is pullable (image pull failures)
Verify the task execution role has required permissions
Manually start a new task if the service is not recovering automatically

Aurora CPU High Warning

Alarm: Trinity-Beast-Aurora-CPU-High

What it means: The Aurora Serverless v2 cluster is consuming more than 80% CPU.

Check for slow queries — use Performance Insights or pg_stat_statements
Verify ACU scaling — is the cluster at max ACU and still under pressure?
Check if the nightly sync job is running and creating batch write pressure
Look for missing indexes on frequently queried columns
Consider increasing the max ACU limit if load is legitimate

Aurora Connections High Warning

Alarm: Trinity-Beast-Aurora-Connections-High

What it means: More than 80 active database connections — approaching the connection limit.

Check connection pool settings in the application — are pools sized correctly?
Look for connection leaks — connections opened but never returned to the pool
Verify that the sync job and Lambda are not opening excessive connections
Consider using RDS Proxy if connection pressure is persistent
Check if a recent deployment changed pool configuration

ElastiCache CPU / Memory / Connections Warning

Alarms: Trinity-Beast-ElastiCache-CPU-High, ElastiCache-Memory-High, ElastiCache-Connections-High

What it means: The ElastiCache cluster is under resource pressure — CPU, memory, or connection count is elevated.

Check for a cache stampede — many cache misses causing simultaneous upstream fetches
Review key eviction metrics — if memory is full, keys are being evicted prematurely
Check connection pool settings in the LPO service — are connections being reused properly?
Look for large keys or hot keys that may be causing uneven load
If memory is consistently high, consider scaling to a larger node type or adding shards
Review TTL settings — are cached items living too long and consuming memory?

S3 Unusual Size Growth Low Priority

Alarm: Trinity-Beast-S3-Size-Unusual-Growth

What it means: The S3 bucket has exceeded 10 GB, which may indicate unexpected data accumulation.

Check for unexpected uploads — review S3 access logs or CloudTrail for PutObject events
Look for log file accumulation — are old log exports or reports piling up?
Verify lifecycle policies are in place to expire or transition old objects
Check if the LRS report output is being stored without cleanup
Review bucket versioning — old versions may be consuming space