Streamlined performance validation — 7 targeted tests covering Public and Partner access paths with real-time 24-counter telemetry and cluster-wide aggregation.
The v4.1 plan had 12 tests across 3 phases — LPO Only, LRS Only, and Combined — each tested across 4 topologies (TCP Direct, TCP ALB, UDP Direct, UDP NLB). That was 12 tests requiring 4 infrastructure reconfigurations and 4.5 hours.
The v5.0 plan eliminates redundancy by recognizing three facts:
Result: 7 targeted tests in 2 phases. ~2.5 hours instead of 4.5. Every test answers a specific question about a real access path. Zero redundancy.
| Characteristic | Public Subscriber | AWS Partner |
|---|---|---|
| Protocol | TCP (HTTPS via ALB) or UDP (via NLB) | TCP (PrivateLink) or UDP (VPC Peering / NLB) |
| Rate Limiting | Enforced — QPS + burst + monthly cap | Bypassed — zero rate limiting, zero caps |
| TLS | ALB terminates TLS (adds ~1-2ms) | PrivateLink or direct — no TLS overhead |
| Billing Checks | Monthly usage validated per request | Skipped entirely |
| API Key | Required — identifies account for billing | Required — identifies partner for tracking and analytics |
| Connection | Public internet → CloudFront → ALB | AWS backbone → PrivateLink / VPC Peering |
Partner API Keys: Partners are not required to authenticate by the exchanges — but we require an API key for every partner. This is not a restriction. It is visibility. You cannot manage what you cannot measure. A partner key with zero rate limits and zero billing still gives us per-partner usage tracking, analytics, and the ability to identify issues. That is good engineering.
| Component | Specification | Role |
|---|---|---|
| ECS Fargate | 3 × 8 vCPU / 32 GB RAM (APP_REPORT_SERVER) | Application containers — LPO + LRS combined |
| ElastiCache | cache.r7g.2xlarge, Valkey 7.2, TLS, 52 GB | Price cache, usage log indexes, cluster stats |
| Aurora Serverless v2 | PostgreSQL 17.7, Optimized I/O, 2–18 ACU | Source of truth — API keys, usage logs, parameters |
| ALB | Trinity-Beast-TCP-ALB | TCP load balancing (443 → 8080/9090) |
| NLB | Trinity-Beast-UDP-NLB | UDP load balancing (2679/2680) |
| Stress Client | c6i.metal or equivalent (96 vCPU) | Load generator — same region (us-east-2) |
Hot Path: Price requests are served from an in-process sync.Map populated by 6 persistent WebSocket feeds (Coinbase, Gemini, Kraken, Gate.io, Bybit, OKX). Zero network calls on the hot path. ElastiCache is the second layer (sub-millisecond). REST API fallback is the third layer (cache miss only). Under stress testing with 300s cache TTL, 99%+ of requests hit the sync.Map.
Each test follows a 13-level progressive load ramp. Concurrency doubles at each level while request volume scales proportionally. This reveals the exact concurrency threshold where performance degrades.
| Level | Requests | Concurrent | Purpose |
|---|---|---|---|
| 1 | 300 | 30 | Warm-up — connection pool initialization |
| 2 | 900 | 90 | Light load — verify cold-start fix |
| 3 | 3,000 | 300 | Moderate load — baseline throughput |
| 4 | 9,000 | 600 | Sustained load — batch pipeline under pressure |
| 5 | 30,000 | 900 | High load — entering sweet spot |
| 6 | 90,000 | 1,500 | Heavy load — peak throughput zone |
| 7 | 300,000 | 3,000 | Extreme load — maximum RPS target |
| 8 | 600,000 | 6,000 | Overload — testing graceful degradation |
| 9 | 900,000 | 9,000 | Severe overload — success rate threshold |
| 10 | 1,500,000 | 12,000 | Breaking point — where failures begin |
| 11 | 3,000,000 | 15,000 | Recovery test — can the system stabilize? |
| 12 | 6,000,000 | 18,000 | Endurance — sustained extreme load |
| 13 | 9,000,000 | 21,000 | Absolute ceiling — maximum concurrent connections |
Success Criteria: 100% success rate through level 9 (9,000 concurrent). Graceful degradation above that. p50 latency under 10ms for cache hits. p99 under 300ms through the sweet spot (levels 5–8). Zero 5xx errors through level 9.
Every container runs 24 atomic.Int64 counters that track the complete request lifecycle. The /admin/stress-stats endpoint returns a per-container snapshot. The /admin/cluster-stats endpoint reads all 3 snapshots from ElastiCache in a single pipeline call — one round-trip, sub-millisecond, guaranteed all 3 containers.
tcp_rps, udp_rps, total_rps — real-time requests per second by protocol
syncmap_hit_pct, cache_hit_pct — three-layer visibility. sync.Map should be 99%+ at 300s TTL
udp_drop_pct, packets_received vs packets_sent — packet loss visibility
bg_drop_pct, submitted vs completed — housekeeping saturation
rows_queued — SQS messages sent (usage log entries queued for Lambda consumer)
db_open, db_in_use, db_wait_count — connection pool utilization
graph LR
subgraph Containers
M[BeastMain
24 counters] -->|"every 3s"| EC
Mi[BeastMirror
24 counters] -->|"every 3s"| EC
L[BeastLRS
24 counters] -->|"every 3s"| EC
end
EC[(ElastiCache
cluster:stats:*
TTL 30s)]
EC -->|"1 pipeline read"| CS[/admin/cluster-stats/]
CS --> TBCC[Command Center
Cluster Health]
CS --> CW[CloudWatch
6 Metrics]
style M fill:#334155,stroke:#64748b,color:#94a3b8
style Mi fill:#334155,stroke:#64748b,color:#94a3b8
style L fill:#334155,stroke:#64748b,color:#94a3b8
style EC fill:#5a3a3a,stroke:#8a5a5a,color:#e2c8c8
style CS fill:#2d5a4a,stroke:#4a9a7a,color:#cbd5e1
style TBCC fill:#2d4a6f,stroke:#4a7ab5,color:#cbd5e1
style CW fill:#3d3a5c,stroke:#6b6399,color:#cbd5e1
The Tuning Loop: Poll /admin/cluster-stats every 3 seconds during testing. If udp_drop_pct exceeds 1%, increase worker pool capacity. If bg_drop_pct exceeds 5%, relax flush intervals. If db_in_use approaches db_max_open, the connection pool is saturating. Each metric points to a specific application parameter — most adjustable without a restart via /admin/system-mode.
| Test | Protocol | Access Path | Topology | What It Measures |
|---|---|---|---|---|
| Phase 1 — Single Container Ceiling (Direct, No Load Balancer) | ||||
| 1a | TCP | — | Direct to 1 node | Absolute single-container TCP ceiling |
| 1b | UDP | — | Direct to 1 node | Absolute single-container UDP ceiling (v6 engine) |
| Phase 2 — Production Topology (3 Nodes, APP_REPORT_SERVER, Load Balanced) | ||||
| 2a | TCP via ALB | Public | 3 nodes, distributed AZs | Production TCP throughput (subscribers, rate limited) |
| 2b | UDP via NLB | Partner | 3 nodes, distributed AZs | Production UDP throughput (partners, no rate limit) |
| 2c | UDP via NLB | Public | 3 nodes, distributed AZs | Production UDP throughput (subscribers, rate limited) |
| 2d | TCP via ALB | Combined | 3 nodes, distributed AZs | LPO + LRS simultaneous load (resource contention) |
| 2e | TCP via ALB | Endurance | 3 nodes, distributed AZs | 30-minute sustained load at 80% of ceiling |
graph TD
START[Launch Stress Client
96 vCPU] --> P1[Phase 1: Single Container]
P1 --> T1A[1a: TCP Direct
Raw TCP ceiling]
P1 --> T1B[1b: UDP Direct
v6 engine ceiling]
T1A --> RECONFIG[Reconfigure:
Distribute AZs
Enable Governor]
T1B --> RECONFIG
RECONFIG --> P2[Phase 2: Production Topology]
P2 --> T2A[2a: TCP ALB
Public subscribers]
P2 --> T2B[2b: UDP NLB
Partner — no rate limit]
P2 --> T2C[2c: UDP NLB
Public — rate limited]
T2A --> T2D[2d: Combined
LPO + LRS simultaneous]
T2B --> T2D
T2C --> T2D
T2D --> T2E[2e: Endurance
30 min at 80% ceiling]
T2E --> RESTORE[Restore Production
fresh-price profile]
style START fill:#4a5568,stroke:#718096,color:#e2e8f0
style P1 fill:#5a4a2d,stroke:#9a7a4a,color:#cbd5e1
style P2 fill:#2d4a6f,stroke:#4a7ab5,color:#cbd5e1
style RECONFIG fill:#3d3a5c,stroke:#6b6399,color:#cbd5e1
style RESTORE fill:#2d5a4a,stroke:#4a9a7a,color:#cbd5e1
style T1A fill:#334155,stroke:#64748b,color:#94a3b8
style T1B fill:#334155,stroke:#64748b,color:#94a3b8
style T2A fill:#334155,stroke:#64748b,color:#94a3b8
style T2B fill:#334155,stroke:#64748b,color:#94a3b8
style T2C fill:#334155,stroke:#64748b,color:#94a3b8
style T2D fill:#334155,stroke:#64748b,color:#94a3b8
style T2E fill:#334155,stroke:#64748b,color:#94a3b8
Isolates a single container to find the raw per-node ceiling for each protocol. No load balancer, no TLS termination, no cross-AZ latency. Governor disabled. Stress client connects directly to the container IP. This establishes the baseline that Phase 2 builds on.
Measures the absolute single-container TCP ceiling. Stress client connects directly to the container IP on port 8080, bypassing the ALB. This is the purest throughput test — no load balancer overhead, no TLS termination.
Previous result (v5): 132K RPS at 3,000 concurrent, 100% success through 9,000 concurrent, p50=4.9ms. This is the benchmark to beat with the v6 binary.
Key metrics to watch: tcp_rps, syncmap_hit_pct (should be 99%+), db_in_use (should stay well below db_max_open), batch_flush_errors (should be 0).
The most anticipated test. v5 peaked at 44K RPS — bottlenecked by json.Marshal and single-socket WriteToUDP serialization. v6 introduces zero-alloc response building and multi-socket architecture. Target: 80K+ RPS.
Key metrics to watch: udp_rps, udp_drop_pct (the critical number — above 1% means workers can't keep up), bg_drop_pct (housekeeping saturation), CPU% in Container Insights.
All 3 containers running APP_REPORT_SERVER (LPO + LRS), distributed across AZs (2a/2b/2c), governor enabled. This is the real-world configuration. Every test in this phase answers a question about how the system performs under production conditions.
Production TCP throughput for public subscribers. Traffic flows through the ALB with TLS termination, rate limiting active, monthly cap checks enforced. This is the number that goes on the marketing page — what a subscriber actually experiences.
Key metrics to watch: total_rps across all 3 nodes (via /admin/cluster-stats), rate_limit_hits (should be 0 with stress tier key), ALB TargetResponseTime p99.
Production UDP throughput for AWS Partners. Partner API keys bypass all rate limiting, monthly caps, and billing checks — both in the TCP and UDP handlers. NLB operates at Layer 4 with near-zero added latency. This measures the fastest possible path through The Trinity Beast.
Key metrics to watch: udp_rps cluster-wide, udp_drop_pct per node, rate_limit_hits (should be exactly 0 — partner keys skip the limiter).
Same infrastructure as 2b, but with a public-tier API key. Rate limiting and monthly cap checks are active on every packet. This measures the exact overhead of rate limiting on the UDP path — compare udp_rps against test 2b to quantify the cost.
Key metrics to watch: udp_rps (compare to 2b), rate_limit_hits (should be 0 with stress-tier QPS), the delta between 2b and 2c reveals the per-packet cost of rate limit checks.
The resource contention test. Two stress clients run simultaneously — one hammering /price (LPO) and one hammering /reports/usage (LRS). Both services compete for the same CPU, memory, DB connections, and cache pool. The gap between this number and test 2a reveals exactly how much LRS overhead costs under load.
Key metrics to watch: tcp_rps + lrs_requests (both should be tracked), db_in_use (combined workload may approach db_max_open=180), batch_flush_errors.
Sustained load at 80% of the ceiling found in test 2a. Runs for 30 continuous minutes. This tests for memory leaks, connection pool exhaustion, goroutine accumulation, cache TTL edge cases, and Aurora ACU scaling behavior over time. If the system is stable at 80% for 30 minutes, it's production-ready.
Key metrics to watch: Memory% trend in Container Insights (should be flat, not climbing), db_wait_count (should stay at 0), Aurora ACU (should stabilize, not keep climbing), errors_5xx (must remain 0 for the full 30 minutes).
Streamlined from 16 profiles to 8. Each test uses a profile specifically optimized for its protocol and topology. Profiles are applied instantly via /admin/system-mode?mode=<name>.
| Profile | Test | Batch | uBat | Flush ms | DBOpen | DBIdle | CPool | CIdle | CRdMs |
|---|---|---|---|---|---|---|---|---|---|
stress-tcp-direct | 1a | 300 | 500 | 2,000 | 150 | 150 | 2,997 | 999 | 500 |
stress-udp-direct | 1b | 100 | 100 | 3,000 | 150 | 150 | 2,997 | 999 | 500 |
stress-tcp-alb | 2a, 2e | 300 | 500 | 500 | 150 | 150 | 1,998 | 666 | 500 |
stress-udp-nlb | 2b, 2c | 100 | 100 | 1,500 | 150 | 150 | 1,998 | 666 | 500 |
stress-combined-alb | 2d | 300 | 500 | 500 | 180 | 180 | 1,998 | 666 | 500 |
All stress profiles share: QPS=100K, Burst=100K, TTL=300s, log_level=error, config_poll=300, cache_max_retries=1, cache_dial_ms=500, cache_write_ms=500.
Five data sources monitored simultaneously during each test. No X-Ray (adds latency). All monitoring is passive — zero impact on the system under test.
| Source | What to Watch | Red Flag |
|---|---|---|
| Container Insights | CPU%, Memory%, NetworkTx/Rx per container | CPU > 90%, Memory climbing (leak), Rx >> Tx (dropping packets) |
| Aurora Performance Insights | ACU usage, active sessions, top SQL, wait events | ACU > 10, sessions = max_open, IO:DataFileRead |
| ElastiCache Metrics | EngineCPU, CurrConnections, CacheHits/Misses | EngineCPU > 60% (single-threaded), Misses spiking |
| ALB/NLB Metrics | TargetResponseTime, 5xx count, ActiveConnections | Any 5xx, ResponseTime p99 > 500ms, UnhealthyHostCount > 0 |
| /admin/cluster-stats | All 24 counters aggregated across 3 nodes | syncmap_hit < 99%, udp_drop > 1%, bg_drop > 5%, flush_errors > 0 |
New in v5.0: The /admin/cluster-stats endpoint reads all 3 container snapshots from ElastiCache in a single pipeline call. No more polling individual containers through the ALB. One call, sub-millisecond, guaranteed all 3 nodes.
| Step | Command / Action |
|---|---|
| Apply profile | /admin/system-mode?mode=<profile> |
| Reset metrics | /admin/stress-reset |
| Verify reset | /admin/cluster-stats — confirm all counters are zero across all 3 nodes |
| Governor setting | Disabled for Phase 1 (direct), enabled for Phase 2 (production) |
| Trim usage_logs | Keep under 100K rows to avoid sync interference |
| Open dashboards | Container Insights, Aurora PI, ElastiCache, ALB/NLB metrics |
| Step | Action |
|---|---|
| Restore AZs | Main=2a, Mirror=2b, LRS=2c |
| Apply production profile | /admin/system-mode?mode=fresh-price |
| Re-enable governor | adaptive_enabled=true |
| Trim usage_logs | Remove stress test rows |
| Terminate stress client | EC2 instance termination |
| Record results | Update Performance Report document |
The v6 engine addresses every bottleneck identified during v5 testing. These are code-level changes targeting the CPU-bound UDP hot path.
| Optimization | v5 | v6 | Expected Impact |
|---|---|---|---|
| Response serialization | json.Marshal — reflection, interface boxing | buildUDPResponse() — direct byte append, pooled buffers | ~70% faster, zero heap allocations |
| Socket architecture | Single shared net.UDPConn | One socket per reader goroutine | 3× write parallelism |
| Worker pools | Shared across all readers | Per-socket pool with dedicated channel | Zero cross-socket contention |
| Buffer management | sync.Pool for read buffers only | sync.Pool for both read and response buffers | Reduced GC pressure |
| Rate limiting | Not enforced on UDP | Full rate limiting + monthly limits on UDP | Production-ready UDP security |
Total estimated time: ~2.5 hours including one infrastructure reconfiguration between phases.
| Block | Activity | Duration |
|---|---|---|
| Setup | Launch stress client EC2 (96 vCPU), consolidate AZs to 2a, apply stress profile, open dashboards | 20 min |
| Phase 1 — Single Container Ceiling | ||
| 1a | TCP Direct — single-container TCP ceiling | 15 min |
| 1b | UDP Direct — v6 engine UDP ceiling | 15 min |
| — | Reconfigure: distribute AZs (a/b/c), enable governor, restart for ALB/NLB pool sizes | 10 min |
| Phase 2 — Production Topology | ||
| 2a | TCP via ALB — public subscriber throughput | 15 min |
| 2b | UDP via NLB — partner access (no rate limit) | 15 min |
| 2c | UDP via NLB — public subscriber (rate limited) | 15 min |
| 2d | Combined — LPO + LRS simultaneous load | 20 min |
| 2e | Endurance — 30 min sustained at 80% ceiling | 35 min |
| Restore Production | ||
| — | Restore AZs, fresh-price profile, terminate stress client, record results | 15 min |
| Total Estimated Time | ~2.5 hours | |
Only 1 infrastructure reconfiguration — between Phase 1 (direct, single AZ) and Phase 2 (distributed, 3 AZs). All Phase 2 tests share the same topology. Compare this to v4.1 which required 4 reconfigurations across 12 tests.