The Trinity Beast Infrastructure (TBI) uses Amazon CloudWatch as its centralized monitoring and alerting platform. This guide documents every dashboard, alarm, log group, and notification channel deployed across the system.
TrinityBeast/LPO and TrinityBeast/LRS namespacesFour CloudWatch dashboards provide layered visibility — from real-time application metrics to executive cost summaries.
| Dashboard | Purpose |
|---|---|
Trinity-Beast-Application-Dashboard |
Primary ops dashboard — LPO, LRS, AWS infra, Lambda, logs |
Trinity-Beast-Master-Dashboard |
Comprehensive view across all services |
Trinity-Beast-Cost-Detailed-Dashboard |
Detailed cost breakdown by service |
Trinity-Beast-Cost-Executive-Dashboard |
Executive cost summary |
The Trinity-Beast-Application-Dashboard is the primary operational dashboard. It contains widgets organized into six sections covering every layer of the stack.
| Widget | Type |
|---|---|
| LPO Requests (per minute) | Metric — line graph |
| Cache Hit Rate (%) | Metric — gauge / number |
| Avg Latency (ms) | Metric — line graph |
| Cache Hits vs Misses | Metric — stacked area |
| Requests by Asset | Metric — bar chart |
| Requests by Source (Exchange) | Metric — bar chart |
| Errors & Source Failovers | Metric — line graph |
| Widget | Type |
|---|---|
| LRS Total Requests | Metric — line graph |
| LRS Avg Latency (ms) | Metric — line graph |
| LRS Output Format Usage | Metric — bar chart |
| LRS Errors | Metric — line graph |
| Widget | Type |
|---|---|
| ECS CPU Utilization (%) | Metric — line graph |
| ECS Memory Utilization (%) | Metric — line graph |
| ALB Response Time & Errors | Metric — line graph |
| ElastiCache CPU & Cache Hit Rate | Metric — line graph |
| ElastiCache Memory Usage (%) | Metric — gauge / number |
| Aurora Serverless Capacity (ACU) | Metric — line graph |
| Widget | Type |
|---|---|
| LPO — Main Service Logs | Log query |
| LRS — Report Service Logs | Log query |
| Mirror Service Logs | Log query |
| Sync Job Logs | Log query |
| Widget | Type |
|---|---|
| Lambda Invocations | Metric — line graph |
| Lambda Errors | Metric — line graph |
| Lambda Duration (ms) | Metric — line graph |
| Throttles & Concurrency | Metric — line graph |
| Receipts by Handler Type | Log widget |
| Recent Receipts — Handler Detail | Log widget |
| Receipt Lambda Logs | Log query |
| Widget | Type |
|---|---|
| CloudTrail — Errors & Access Denied | Log query |
| CloudTrail — ECS & Infrastructure Changes | Log query |
| VPC Flow Logs — Rejected Traffic (Trinity VPC) | Log query |
Two dedicated cost dashboards provide financial visibility into the Trinity Beast Infrastructure spend.
Shows a per-service cost breakdown including ECS Fargate, Aurora Serverless, ElastiCache, Lambda, S3, CloudWatch, NAT Gateway, and data transfer. Each service is displayed with daily and monthly cost trends, making it easy to identify which component is driving spend.
Provides a high-level monthly cost summary with total spend, month-over-month trends, and projected costs. Designed for stakeholders who need a quick financial snapshot without per-service granularity.
14 alarms monitor critical infrastructure metrics. All alarms publish to the Trinity-Beast-Critical-Alerts SNS topic, triggering both email and SMS notifications simultaneously.
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-ALB-UnhealthyTargets |
UnHealthyHostCount | AWS/ApplicationELB | >= 1 | 60s | 3 | OK |
Trinity-Beast-NLB-UnhealthyTargets |
UnHealthyHostCount | AWS/NetworkELB | >= 1 | 60s | 3 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State | Notes |
|---|---|---|---|---|---|---|---|
Trinity-Beast-ECS-CPU-High |
CPUUtilization | AWS/ECS (main-service) | > 80% | 300s | 2 | OK | — |
Trinity-Beast-ECS-CPU-High-Mirror |
CPUUtilization | AWS/ECS (mirror-service) | > 80% | 300s | 2 | OK | — |
Trinity-Beast-ECS-CPU-High-LRS |
CPUUtilization | AWS/ECS (lrs-service) | > 80% | 300s | 2 | OK | — |
Trinity-Beast-Main-Service-Count-Low |
RunningTaskCount | ECS/ContainerInsights (main) | < 1 | 300s | 2 | OK | TreatMissing: breaching |
Trinity-Beast-Mirror-Service-Count-Low |
RunningTaskCount | ECS/ContainerInsights (mirror) | < 1 | 300s | 2 | OK | TreatMissing: breaching |
Trinity-Beast-LRS-Service-Count-Low |
RunningTaskCount | ECS/ContainerInsights (lrs) | < 1 | 300s | 2 | OK | TreatMissing: breaching |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-Aurora-CPU-High |
CPUUtilization | AWS/RDS (trinity-beast-aurora-cluster) | > 80% | 300s | 2 | OK |
Trinity-Beast-Aurora-Connections-High |
DatabaseConnections | AWS/RDS (trinity-beast-aurora-cluster) | > 80 | 300s | 2 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-ElastiCache-CPU-High |
CPUUtilization | AWS/ElastiCache | > 80% | 300s | 2 | OK |
Trinity-Beast-ElastiCache-Memory-High |
DatabaseMemoryUsagePercentage | AWS/ElastiCache | > 85% | 300s | 2 | OK/ALARM |
Trinity-Beast-ElastiCache-Connections-High |
CurrConnections | AWS/ElastiCache | > 1000 | 300s | 2 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-S3-Size-Unusual-Growth |
BucketSizeBytes | AWS/S3 | > 10 GB | 86400s | 1 | OK |
All 14 CloudWatch alarms route to a single SNS topic that delivers alerts through two channels simultaneously.
| Protocol | Endpoint | Status |
|---|---|---|
Admin@CPMP-Site.org |
Confirmed | |
| SMS | +16156128200 |
Confirmed |
Behavior: When any of the 14 alarms transitions to ALARM state, both email and SMS notifications fire simultaneously. There is no escalation chain — both channels receive every alert.
10 log groups capture output from every service layer. All groups are configured with a 30-day retention policy.
| Log Group | Retention | Source |
|---|---|---|
/aws/ecs/trinity-beast |
30 days | All 3 ECS services (LPO, Mirror, LRS) |
/aws/ecs/trinity-beast-sync |
30 days | Nightly sync job |
/ecs/trinity-beast-lpo |
30 days | Legacy LPO logs |
/ecs/trinity-beast-main-task-container-def |
30 days | Legacy main task logs |
/aws/lambda/trinity-beast-receipt |
30 days | Receipt Lambda |
/aws/vpc/trinity-beast-flowlogs |
30 days | VPC Flow Logs |
/aws/cloudtrail/trinity-beast |
30 days | CloudTrail audit logs |
/aws/codebuild/trinity-beast-build |
30 days | CodeBuild logs |
/aws/ecs/containerinsights/trinity-beast-fargate-cluster/performance |
30 days | Container Insights |
RDSOSMetrics |
30 days | Aurora OS metrics |
The application publishes custom metrics to two CloudWatch namespaces, providing business-level observability beyond standard AWS metrics.
Metrics published by the Live Price Oracle service:
| Metric | Description |
|---|---|
Requests | Total LPO requests received |
CacheHits | Requests served from ElastiCache cache |
CacheMisses | Requests requiring upstream source fetch |
Errors | Failed requests (all error types) |
SourceFailovers | Times a primary source failed and secondary was used |
AvgLatency | Average response time in milliseconds |
Metrics published by the Live Report Service:
| Metric | Description |
|---|---|
Requests | Total LRS report requests |
AvgLatency | Average report generation time in milliseconds |
Errors | Failed report generations |
MonthlyLimitExceeded | Requests rejected due to monthly quota |
DailyLimitExceeded | Requests rejected due to daily quota |
AddOnRequests | Requests using add-on quota beyond base plan |
When an alarm fires, use the following runbooks to diagnose and resolve the issue. Each category includes the most common root causes and recommended actions.
What it means: One or more ECS tasks are failing health checks from the load balancer.
/aws/ecs/trinity-beast for startup errors or OOM killsWhat it means: An ECS service is consuming more than 80% CPU over a sustained period.
What it means: A container has crashed and no tasks are running for the service. These alarms use TreatMissing: breaching, so missing data also triggers the alarm.
What it means: The Aurora Serverless v2 cluster is consuming more than 80% CPU.
pg_stat_statementsWhat it means: More than 80 active database connections — approaching the connection limit.
What it means: The ElastiCache cluster is under resource pressure — CPU, memory, or connection count is elevated.
What it means: The S3 bucket has exceeded 10 GB, which may indicate unexpected data accumulation.
PutObject events