The Trinity Beast — Stress Test Plan v5.0

Streamlined performance validation — 7 targeted tests covering Public and Partner access paths with real-time 24-counter telemetry and cluster-wide aggregation.

Region: us-east-2 (Ohio) Protocols: TCP + UDP Containers: 3 × 8 vCPU / 32 GB Tests: 7 Est. Time: ~2.5 hours Updated: April 2026

Table of Contents

1. Test Philosophy — What Changed

The v4.1 plan had 12 tests across 3 phases — LPO Only, LRS Only, and Combined — each tested across 4 topologies (TCP Direct, TCP ALB, UDP Direct, UDP NLB). That was 12 tests requiring 4 infrastructure reconfigurations and 4.5 hours.

The v5.0 plan eliminates redundancy by recognizing three facts:

Result: 7 targeted tests in 2 phases. ~2.5 hours instead of 4.5. Every test answers a specific question about a real access path. Zero redundancy.

Public vs Partner Access Paths

CharacteristicPublic SubscriberAWS Partner
ProtocolTCP (HTTPS via ALB) or UDP (via NLB)TCP (PrivateLink) or UDP (VPC Peering / NLB)
Rate LimitingEnforced — QPS + burst + monthly capBypassed — zero rate limiting, zero caps
TLSALB terminates TLS (adds ~1-2ms)PrivateLink or direct — no TLS overhead
Billing ChecksMonthly usage validated per requestSkipped entirely
API KeyRequired — identifies account for billingRequired — identifies partner for tracking and analytics
ConnectionPublic internet → CloudFront → ALBAWS backbone → PrivateLink / VPC Peering

Partner API Keys: Partners are not required to authenticate by the exchanges — but we require an API key for every partner. This is not a restriction. It is visibility. You cannot manage what you cannot measure. A partner key with zero rate limits and zero billing still gives us per-partner usage tracking, analytics, and the ability to identify issues. That is good engineering.

2. Infrastructure Under Test

ECS Containers
3
vCPU / Container
8
RAM / Container
32 GB
ElastiCache
52 GB
Aurora ACU
2–18
Stress Client
96 vCPU
ComponentSpecificationRole
ECS Fargate3 × 8 vCPU / 32 GB RAM (APP_REPORT_SERVER)Application containers — LPO + LRS combined
ElastiCachecache.r7g.2xlarge, Valkey 7.2, TLS, 52 GBPrice cache, usage log indexes, cluster stats
Aurora Serverless v2PostgreSQL 17.7, Optimized I/O, 2–18 ACUSource of truth — API keys, usage logs, parameters
ALBTrinity-Beast-TCP-ALBTCP load balancing (443 → 8080/9090)
NLBTrinity-Beast-UDP-NLBUDP load balancing (2679/2680)
Stress Clientc6i.metal or equivalent (96 vCPU)Load generator — same region (us-east-2)

Hot Path: Price requests are served from an in-process sync.Map populated by 6 persistent WebSocket feeds (Coinbase, Gemini, Kraken, Gate.io, Bybit, OKX). Zero network calls on the hot path. ElastiCache is the second layer (sub-millisecond). REST API fallback is the third layer (cache miss only). Under stress testing with 300s cache TTL, 99%+ of requests hit the sync.Map.

3. Progressive Load Ramp (13 Levels)

Each test follows a 13-level progressive load ramp. Concurrency doubles at each level while request volume scales proportionally. This reveals the exact concurrency threshold where performance degrades.

LevelRequestsConcurrentPurpose
130030Warm-up — connection pool initialization
290090Light load — verify cold-start fix
33,000300Moderate load — baseline throughput
49,000600Sustained load — batch pipeline under pressure
530,000900High load — entering sweet spot
690,0001,500Heavy load — peak throughput zone
7300,0003,000Extreme load — maximum RPS target
8600,0006,000Overload — testing graceful degradation
9900,0009,000Severe overload — success rate threshold
101,500,00012,000Breaking point — where failures begin
113,000,00015,000Recovery test — can the system stabilize?
126,000,00018,000Endurance — sustained extreme load
139,000,00021,000Absolute ceiling — maximum concurrent connections

Success Criteria: 100% success rate through level 9 (9,000 concurrent). Graceful degradation above that. p50 latency under 10ms for cache hits. p99 under 300ms through the sweet spot (levels 5–8). Zero 5xx errors through level 9.

4. Real-Time Telemetry & Cluster Aggregation

Every container runs 24 atomic.Int64 counters that track the complete request lifecycle. The /admin/stress-stats endpoint returns a per-container snapshot. The /admin/cluster-stats endpoint reads all 3 snapshots from ElastiCache in a single pipeline call — one round-trip, sub-millisecond, guaranteed all 3 containers.

Throughput

tcp_rps, udp_rps, total_rps — real-time requests per second by protocol

🎯 Cache Performance

syncmap_hit_pct, cache_hit_pct — three-layer visibility. sync.Map should be 99%+ at 300s TTL

UDP Health

udp_drop_pct, packets_received vs packets_sent — packet loss visibility

Background Pool

bg_drop_pct, submitted vs completed — housekeeping saturation

💾 SQS Pipeline

rows_queued — SQS messages sent (usage log entries queued for Lambda consumer)

🔌 DB Connections

db_open, db_in_use, db_wait_count — connection pool utilization

Diagram 4.1 — Telemetry Pipeline
graph LR
    subgraph Containers
        M[BeastMain
24 counters] -->|"every 3s"| EC Mi[BeastMirror
24 counters] -->|"every 3s"| EC L[BeastLRS
24 counters] -->|"every 3s"| EC end EC[(ElastiCache
cluster:stats:*
TTL 30s)] EC -->|"1 pipeline read"| CS[/admin/cluster-stats/] CS --> TBCC[Command Center
Cluster Health] CS --> CW[CloudWatch
6 Metrics] style M fill:#334155,stroke:#64748b,color:#94a3b8 style Mi fill:#334155,stroke:#64748b,color:#94a3b8 style L fill:#334155,stroke:#64748b,color:#94a3b8 style EC fill:#5a3a3a,stroke:#8a5a5a,color:#e2c8c8 style CS fill:#2d5a4a,stroke:#4a9a7a,color:#cbd5e1 style TBCC fill:#2d4a6f,stroke:#4a7ab5,color:#cbd5e1 style CW fill:#3d3a5c,stroke:#6b6399,color:#cbd5e1

The Tuning Loop: Poll /admin/cluster-stats every 3 seconds during testing. If udp_drop_pct exceeds 1%, increase worker pool capacity. If bg_drop_pct exceeds 5%, relax flush intervals. If db_in_use approaches db_max_open, the connection pool is saturating. Each metric points to a specific application parameter — most adjustable without a restart via /admin/system-mode.

5. Test Matrix — 7 Tests, 2 Phases

TestProtocolAccess PathTopologyWhat It Measures
Phase 1 — Single Container Ceiling (Direct, No Load Balancer)
1aTCPDirect to 1 nodeAbsolute single-container TCP ceiling
1bUDPDirect to 1 nodeAbsolute single-container UDP ceiling (v6 engine)
Phase 2 — Production Topology (3 Nodes, APP_REPORT_SERVER, Load Balanced)
2aTCP via ALBPublic3 nodes, distributed AZsProduction TCP throughput (subscribers, rate limited)
2bUDP via NLBPartner3 nodes, distributed AZsProduction UDP throughput (partners, no rate limit)
2cUDP via NLBPublic3 nodes, distributed AZsProduction UDP throughput (subscribers, rate limited)
2dTCP via ALBCombined3 nodes, distributed AZsLPO + LRS simultaneous load (resource contention)
2eTCP via ALBEndurance3 nodes, distributed AZs30-minute sustained load at 80% of ceiling
Diagram 5.1 — Test Flow
graph TD
    START[Launch Stress Client
96 vCPU] --> P1[Phase 1: Single Container] P1 --> T1A[1a: TCP Direct
Raw TCP ceiling] P1 --> T1B[1b: UDP Direct
v6 engine ceiling] T1A --> RECONFIG[Reconfigure:
Distribute AZs
Enable Governor] T1B --> RECONFIG RECONFIG --> P2[Phase 2: Production Topology] P2 --> T2A[2a: TCP ALB
Public subscribers] P2 --> T2B[2b: UDP NLB
Partner — no rate limit] P2 --> T2C[2c: UDP NLB
Public — rate limited] T2A --> T2D[2d: Combined
LPO + LRS simultaneous] T2B --> T2D T2C --> T2D T2D --> T2E[2e: Endurance
30 min at 80% ceiling] T2E --> RESTORE[Restore Production
fresh-price profile] style START fill:#4a5568,stroke:#718096,color:#e2e8f0 style P1 fill:#5a4a2d,stroke:#9a7a4a,color:#cbd5e1 style P2 fill:#2d4a6f,stroke:#4a7ab5,color:#cbd5e1 style RECONFIG fill:#3d3a5c,stroke:#6b6399,color:#cbd5e1 style RESTORE fill:#2d5a4a,stroke:#4a9a7a,color:#cbd5e1 style T1A fill:#334155,stroke:#64748b,color:#94a3b8 style T1B fill:#334155,stroke:#64748b,color:#94a3b8 style T2A fill:#334155,stroke:#64748b,color:#94a3b8 style T2B fill:#334155,stroke:#64748b,color:#94a3b8 style T2C fill:#334155,stroke:#64748b,color:#94a3b8 style T2D fill:#334155,stroke:#64748b,color:#94a3b8 style T2E fill:#334155,stroke:#64748b,color:#94a3b8

6. Phase 1 — Single Container Ceiling

Isolates a single container to find the raw per-node ceiling for each protocol. No load balancer, no TLS termination, no cross-AZ latency. Governor disabled. Stress client connects directly to the container IP. This establishes the baseline that Phase 2 builds on.

1a
TCP Direct — Single Node
Profile: stress-tcp-direct | AZ: all in 2a | Governor: DISABLED | Est: 15 min

Measures the absolute single-container TCP ceiling. Stress client connects directly to the container IP on port 8080, bypassing the ALB. This is the purest throughput test — no load balancer overhead, no TLS termination.

Previous result (v5): 132K RPS at 3,000 concurrent, 100% success through 9,000 concurrent, p50=4.9ms. This is the benchmark to beat with the v6 binary.

Key metrics to watch: tcp_rps, syncmap_hit_pct (should be 99%+), db_in_use (should stay well below db_max_open), batch_flush_errors (should be 0).

1b
UDP Direct — Single Node (v6 Engine)
Profile: stress-udp-direct | AZ: all in 2a | Governor: DISABLED | Est: 15 min

The most anticipated test. v5 peaked at 44K RPS — bottlenecked by json.Marshal and single-socket WriteToUDP serialization. v6 introduces zero-alloc response building and multi-socket architecture. Target: 80K+ RPS.

Key metrics to watch: udp_rps, udp_drop_pct (the critical number — above 1% means workers can't keep up), bg_drop_pct (housekeeping saturation), CPU% in Container Insights.

7. Phase 2 — Production Topology

All 3 containers running APP_REPORT_SERVER (LPO + LRS), distributed across AZs (2a/2b/2c), governor enabled. This is the real-world configuration. Every test in this phase answers a question about how the system performs under production conditions.

2a
TCP via ALB — Public Subscribers
Profile: stress-tcp-alb | AZs: a/b/c | Governor: ENABLED | API Key: public tier | Est: 15 min

Production TCP throughput for public subscribers. Traffic flows through the ALB with TLS termination, rate limiting active, monthly cap checks enforced. This is the number that goes on the marketing page — what a subscriber actually experiences.

Key metrics to watch: total_rps across all 3 nodes (via /admin/cluster-stats), rate_limit_hits (should be 0 with stress tier key), ALB TargetResponseTime p99.

2b
UDP via NLB — Partner Access (No Rate Limit)
Profile: stress-udp-nlb | AZs: a/b/c | Governor: ENABLED | API Key: partner tier | Est: 15 min

Production UDP throughput for AWS Partners. Partner API keys bypass all rate limiting, monthly caps, and billing checks — both in the TCP and UDP handlers. NLB operates at Layer 4 with near-zero added latency. This measures the fastest possible path through The Trinity Beast.

Key metrics to watch: udp_rps cluster-wide, udp_drop_pct per node, rate_limit_hits (should be exactly 0 — partner keys skip the limiter).

2c
UDP via NLB — Public Subscribers (Rate Limited)
Profile: stress-udp-nlb | AZs: a/b/c | Governor: ENABLED | API Key: public tier | Est: 15 min

Same infrastructure as 2b, but with a public-tier API key. Rate limiting and monthly cap checks are active on every packet. This measures the exact overhead of rate limiting on the UDP path — compare udp_rps against test 2b to quantify the cost.

Key metrics to watch: udp_rps (compare to 2b), rate_limit_hits (should be 0 with stress-tier QPS), the delta between 2b and 2c reveals the per-packet cost of rate limit checks.

2d
Combined Load — LPO + LRS Simultaneous
Profile: stress-combined-alb | AZs: a/b/c | Governor: ENABLED | Est: 20 min

The resource contention test. Two stress clients run simultaneously — one hammering /price (LPO) and one hammering /reports/usage (LRS). Both services compete for the same CPU, memory, DB connections, and cache pool. The gap between this number and test 2a reveals exactly how much LRS overhead costs under load.

Key metrics to watch: tcp_rps + lrs_requests (both should be tracked), db_in_use (combined workload may approach db_max_open=180), batch_flush_errors.

2e
Endurance — 30 Minutes at 80% Ceiling
Profile: stress-tcp-alb | AZs: a/b/c | Governor: ENABLED | Est: 35 min

Sustained load at 80% of the ceiling found in test 2a. Runs for 30 continuous minutes. This tests for memory leaks, connection pool exhaustion, goroutine accumulation, cache TTL edge cases, and Aurora ACU scaling behavior over time. If the system is stable at 80% for 30 minutes, it's production-ready.

Key metrics to watch: Memory% trend in Container Insights (should be flat, not climbing), db_wait_count (should stay at 0), Aurora ACU (should stabilize, not keep climbing), errors_5xx (must remain 0 for the full 30 minutes).

8. Application Parameter Profiles

Streamlined from 16 profiles to 8. Each test uses a profile specifically optimized for its protocol and topology. Profiles are applied instantly via /admin/system-mode?mode=<name>.

Key Tuning Principles

Profile Matrix

ProfileTestBatchuBatFlush msDBOpenDBIdleCPoolCIdleCRdMs
stress-tcp-direct1a3005002,0001501502,997999500
stress-udp-direct1b1001003,0001501502,997999500
stress-tcp-alb2a, 2e3005005001501501,998666500
stress-udp-nlb2b, 2c1001001,5001501501,998666500
stress-combined-alb2d3005005001801801,998666500

All stress profiles share: QPS=100K, Burst=100K, TTL=300s, log_level=error, config_poll=300, cache_max_retries=1, cache_dial_ms=500, cache_write_ms=500.

9. Race Day Monitoring

Five data sources monitored simultaneously during each test. No X-Ray (adds latency). All monitoring is passive — zero impact on the system under test.

SourceWhat to WatchRed Flag
Container InsightsCPU%, Memory%, NetworkTx/Rx per containerCPU > 90%, Memory climbing (leak), Rx >> Tx (dropping packets)
Aurora Performance InsightsACU usage, active sessions, top SQL, wait eventsACU > 10, sessions = max_open, IO:DataFileRead
ElastiCache MetricsEngineCPU, CurrConnections, CacheHits/MissesEngineCPU > 60% (single-threaded), Misses spiking
ALB/NLB MetricsTargetResponseTime, 5xx count, ActiveConnectionsAny 5xx, ResponseTime p99 > 500ms, UnhealthyHostCount > 0
/admin/cluster-statsAll 24 counters aggregated across 3 nodessyncmap_hit < 99%, udp_drop > 1%, bg_drop > 5%, flush_errors > 0

New in v5.0: The /admin/cluster-stats endpoint reads all 3 container snapshots from ElastiCache in a single pipeline call. No more polling individual containers through the ALB. One call, sub-millisecond, guaranteed all 3 nodes.

10. Pre-Flight & Post-Test Checklists

Pre-Flight (Before Each Test)

StepCommand / Action
Apply profile/admin/system-mode?mode=<profile>
Reset metrics/admin/stress-reset
Verify reset/admin/cluster-stats — confirm all counters are zero across all 3 nodes
Governor settingDisabled for Phase 1 (direct), enabled for Phase 2 (production)
Trim usage_logsKeep under 100K rows to avoid sync interference
Open dashboardsContainer Insights, Aurora PI, ElastiCache, ALB/NLB metrics

Post-Test (After All Tests Complete)

StepAction
Restore AZsMain=2a, Mirror=2b, LRS=2c
Apply production profile/admin/system-mode?mode=fresh-price
Re-enable governoradaptive_enabled=true
Trim usage_logsRemove stress test rows
Terminate stress clientEC2 instance termination
Record resultsUpdate Performance Report document

11. Engine Optimizations (v5 → v6)

The v6 engine addresses every bottleneck identified during v5 testing. These are code-level changes targeting the CPU-bound UDP hot path.

Optimizationv5v6Expected Impact
Response serializationjson.Marshal — reflection, interface boxingbuildUDPResponse() — direct byte append, pooled buffers~70% faster, zero heap allocations
Socket architectureSingle shared net.UDPConnOne socket per reader goroutine3× write parallelism
Worker poolsShared across all readersPer-socket pool with dedicated channelZero cross-socket contention
Buffer managementsync.Pool for read buffers onlysync.Pool for both read and response buffersReduced GC pressure
Rate limitingNot enforced on UDPFull rate limiting + monthly limits on UDPProduction-ready UDP security

12. Estimated Timeline

Total estimated time: ~2.5 hours including one infrastructure reconfiguration between phases.

BlockActivityDuration
SetupLaunch stress client EC2 (96 vCPU), consolidate AZs to 2a, apply stress profile, open dashboards20 min
Phase 1 — Single Container Ceiling
1aTCP Direct — single-container TCP ceiling15 min
1bUDP Direct — v6 engine UDP ceiling15 min
Reconfigure: distribute AZs (a/b/c), enable governor, restart for ALB/NLB pool sizes10 min
Phase 2 — Production Topology
2aTCP via ALB — public subscriber throughput15 min
2bUDP via NLB — partner access (no rate limit)15 min
2cUDP via NLB — public subscriber (rate limited)15 min
2dCombined — LPO + LRS simultaneous load20 min
2eEndurance — 30 min sustained at 80% ceiling35 min
Restore Production
Restore AZs, fresh-price profile, terminate stress client, record results15 min
Total Estimated Time~2.5 hours

Only 1 infrastructure reconfiguration — between Phase 1 (direct, single AZ) and Phase 2 (distributed, 3 AZs). All Phase 2 tests share the same topology. Compare this to v4.1 which required 4 reconfigurations across 12 tests.