The Trinity Beast — Performance Optimization Guide

Connection pooling, three-tier cache strategies, GC tuning, rate limiter configuration, ALB settings, and cost optimization.

Version: v15 Region: us-east-2 Updated: April 2026

Current Status Overview

✅ Already Optimized

Network & Load Balancing

  • v3.9.3 ALB Connection Tuning: 60s idle timeout, 120s keep-alive, 10s deregistration delay, LOR routing on both target groups, invalid header rejection
  • v3.9.3 NLB Connection Tuning: Cross-zone load balancing enabled, 10s deregistration delay on both UDP target groups, healthy threshold reduced to 2 (1 min recovery vs 2.5 min)
  • CORS: Enabled with minimal overhead

Price Feed Architecture

  • 6x WebSocket Price Feeds: Coinbase, Gemini, Kraken, Gate.io, Bybit, OKX — persistent push-based connections, 150 prewarmed assets arrive before requests
  • Per-Container WebSocket Independence: Each container runs its own 6 WS connections, local-only sync.Map writes (no ElastiCache hammering)
  • 3x REST Fallbacks: Gemini → Coinbase → Kraken with health tracking (only used if all WebSocket feeds are stale)
  • AWS Backbone Priority: WebSocket feeds checked first (0ms, in-memory), then AWS-hosted REST (Gemini/Coinbase), then internet (Kraken) as last resort
  • Response-First Architecture: Background logging and metrics — response sent before any write operations

Compute & Runtime

  • Fargate Tasks: 8 vCPU / 32 GB each — all 3 running APP_REPORT_SERVER across 3 AZs
  • Go Runtime: Using all CPUs with runtime.GOMAXPROCS(runtime.NumCPU())
  • Garbage Collection: GOGC=300 (configurable via env var, up from 200)
  • v3.3 Background Worker Pool: 999 slots (up from 500)
  • v3.3 System Mode Toggle: Demo/Performance/Debug profiles via /admin/system-mode

Cache & Data Layer

  • ElastiCache cache.r7g.2xlarge: 52.8 GB cache memory, 400K+ ops/sec capacity, single node, no replica
  • v3.9.3 ElastiCache Pipelining: All 6 sequential HGetAll loops (4 LRS + 2 UDP) replaced with single-round-trip pipelines via PipelineHGetAll()
  • ElastiCache API Key Cache: 3-layer lookup (sync.Map → ElastiCache → Aurora) with write-through
  • ElastiCache App Config: Application parameters read from ElastiCache first, Aurora fallback
  • Shared Rate Limiting: Atomic Lua script in ElastiCache — all 3 containers share rate limit counters
  • Real-time Usage Counters: HINCRBY in ElastiCache on every request for instant LRS stats
  • ElastiCache Connection Pooling: 300 pool size, 60 min idle connections
  • Aurora Optimized I/O: Unlimited IOPS, no per-I/O charges, 40% cost savings, 2–18 ACU
  • Database Connection Pooling: Configurable via app params (150 open / 75 idle per container)
  • v3.3 Micro-Batch Aurora Write Smoothing: 300 rows / 100ms (configurable via app params)

UDP Protocol (v8 Engine)

  • v8 SO_REUSEPORT: 8 sockets per protocol — per-socket kernel receive queue eliminates buffer bottleneck
  • v8 recvmmsg Batch Reads: 32 datagrams per syscall (~32× reduction in read syscalls)
  • v8 Pre-Serialized Response Cache: sync.Map of pre-built byte slices (~2× faster for cache hits)
  • v8 32 MB Socket Buffers: Per socket (up from 8 MB in v3.3)
  • v8 1,024 Concurrent Handlers: 8 SO_REUSEPORT sockets × 128 workers per socket
  • UDP 3-Tier Cache: sync.Map → ElastiCache → REST (matches TCP handler)
  • v3.3 Compiled Go Stress Test Client: cmd/stress/ in mono repo

Current Performance Metrics

TCP Peak (Direct)
369,600
Combined Sustained
746,374
TCP Avg Latency
0.3ms
UDP Avg Latency
0.2ms
Cache Hit Rate
99%+
WebSocket Feeds
6 Active
ElastiCache Pool
300 conn
Aurora ACU
2–18

Implemented in v3.3 ✅ Shipped

The following optimizations were implemented and validated during the v3.3 stress test session. Each change was tested under sustained load with the compiled Go stress client.

Optimization Before (v3.0) After (v3.3) Impact
Container CPU 2 vCPU / 8 GB 8 vCPU / 32 GB 4x throughput, no CPU saturation
Aurora ACU ceiling 6 18 Supports 193K req/sec
GC tuning GOGC=200 GOGC=300 Fewer GC pauses under load
Worker pool 500 slots 999 slots More background work capacity
ElastiCache pool 50 connections 300 connections, 60 min idle No pool exhaustion under load
UDP readers 1 per socket 3 per socket Parallel packet intake
UDP buffers OS default (~200KB) 8MB read + 8MB write No packet loss at high throughput
UDP cache No ElastiCache tier Full 3-tier (sync.Map → ElastiCache → REST) Matches TCP cache architecture
Batch writes 500 rows / 10s (bursty) 300 rows / 100ms micro-batch (smooth) Aurora ACU spikes eliminated
Test client Python (GIL-bound, ~200 req/sec UDP) Compiled Go (487K+ req/sec UDP) Accurate server benchmarking

Remaining Optimization Opportunities 🚀 Potential Improvements

1. ALB Connection Settings ✅ DEPLOYED

ALB optimized for connection reuse, faster deregistration, and security hardening. Deployed April 26, 2026.

SettingBeforeAfterImpact
Idle timeout300s (5 min)60sFrees connection slots 5x faster
Client keep-alive3600s (1 hr)120sClients reconnect every 2 min instead of hoarding
Deregistration delay (both TGs)30s10sDeploys drain 20s faster per service
LRS routing algorithmround_robinleast_outstanding_requestsSmarter load distribution, matches LPO
Drop invalid headersdisabledenabledSecurity hardening — malformed headers rejected at ALB

2. ElastiCache Pipelining ✅ DEPLOYED

All sequential HGetAll loops replaced with single-round-trip pipelines across 6 handler locations. Deployed April 26, 2026.

// PipelineHGetAll — one round trip instead of N sequential calls
pipe := client.Pipeline()
cmds := make([]*redis.MapStringStringCmd, len(ids))
for i, id := range ids {
    cmds[i] = pipe.HGetAll(ctx, fmt.Sprintf("usage_log:%s", id))
}
pipe.Exec(ctx) // single round trip for all N hashes

Locations pipelined: LRS Usage Report, LRS Summary Report, LRS Report Usage Detail, LRS Report Usage Summary, UDP Summary, UDP Usage — all 6 sequential loops converted.

Impact: A report returning 50 rows now makes 1 ElastiCache round trip instead of 50. 30-40% latency reduction on LRS reports.

3. Prewarm Optimization SUPERSEDED

This optimization was designed for the REST polling era. It no longer applies — all 150 assets are now served by 6 persistent WebSocket feeds that push prices in real-time.

ExchangeAssetsFeed TypeLatency
CoinbaseBTC, ETH, SOL, DOGE, XRP, LINK, DOT, LTC, AVAX, UNI, PEPE, XLMWebSocket (push)0ms (in-memory)
GeminiAAVE, ADA, MATIC, ATOM, NEAR, ARB, MKR, CRV, GRT, FIL, SHIB, BATWebSocket (push)0ms (in-memory)
KrakenNANO, SC, LSK, KAVA, BICO, RARI, OCEAN, CFG, CQT, ALGO, FET, FLOWWebSocket (push)0ms (in-memory)
Gate.ioBNB, TRX, APT, SEI, INJ, OP, SUI, VET, HBAR, SAND, MANA, FTMWebSocket (push)0ms (in-memory)
BybitTON, WLD, APE, BLUR, IMX, ENS, LDO, SNX, COMP, 1INCH, SUSHI, GALAWebSocket (push)0ms (in-memory)
OKXKAS, TIA, JUP, STRK, PYTH, W, ZRO, PENDLE, ONDO, RENDER, WIF, FLOKIWebSocket (push)0ms (in-memory)

Why it's obsolete: The original proposal called for tiered REST polling intervals (top assets every 5 min, mid every 15 min, low every 30 min) and staggered timing across containers. With 6 WebSocket feeds pushing every trade in real-time, prices arrive before requests — there's nothing to poll and nothing to stagger. PrewarmCache() runs once at startup as a bootstrap, then WebSocket feeds take over permanently. Natural staggering already occurs because each container's 6 WebSocket connections establish at slightly different times during startup.

4. Aurora Scaling Headroom FUTURE

Monitor Aurora ACU usage and adjust max capacity if needed. Current range is 2–18 ACU.

Current Load ACU Range Action
Consistently under 5 ACU 2–18 ACU ✅ Current — right-sized
Spiking to 18 ACU 2–32 ACU ⚠️ Increase max to 32
Sustained at 18 ACU 2–48 ACU 🚨 Increase max to 48

Monitor: CloudWatch metric ServerlessDatabaseCapacity

5. Task Count Scaling FUTURE

Scale ECS tasks horizontally when traffic increases. Costs reflect 8 vCPU / 32 GB containers.

Traffic Level Main Tasks Mirror Tasks LRS Tasks Monthly Cost
Current (Low) 1 1 1 $430
Medium (50K QPS) 2 2 1 $670
High (100K QPS) 3 2 2 $970
Very High (200K QPS) 5 3 2 $1,390

Trigger: When CPU > 70% or latency > 100ms consistently

6. ElastiCache Scaling FUTURE

Current node is cache.r7g.2xlarge (52.8 GB). ElastiCache is a pure cache layer — Aurora is the source of truth.

Node Type Memory Throughput Monthly Cost
cache.r7g.2xlarge (current) 52.8 GB 400K ops/sec $637
cache.r7g.4xlarge 105 GB 800K ops/sec ~$1,274
cache.r7g.2xlarge + replica 52.8 GB × 2 400K ops/sec + read replica ~$1,274

Trigger: When memory > 80% or CPU > 70% consistently

🎯 Recommended Priority

Immediate

All done ✅ — DB pooling, ElastiCache pooling, batch writes, GC tuning, UDP optimizations, worker pool, and system mode toggle all shipped in v3.3.

Short Term (Next 1-2 Weeks)

  1. Monitor v3.3 Metrics - CloudWatch dashboards for Aurora ACU, ElastiCache CPU/memory, ALB latency under real traffic
  2. Tune SQS Pipeline Params - Adjust sqs_flush_ms and sqs_buffer_size via app params if queue depth patterns change
  3. ALB Connection Settings — ✅ Deployed April 26, 2026
  4. ElastiCache Pipelining — ✅ Deployed April 26, 2026

Long Term (Based on Metrics)

  1. Prewarm Strategy — Superseded by 6 real-time WebSocket feeds (150 assets, 0ms latency)
  2. Horizontal Scaling - Add tasks when traffic increases
  3. ElastiCache Upgrade - Move to xlarge when ops/sec approaches 100K sustained

Monitoring & Metrics

Key CloudWatch Metrics to Watch

Aurora Serverless v2

  • ServerlessDatabaseCapacity - Current ACU usage (target: 2-10 ACU normal, up to 18 under stress)
  • DatabaseConnections - Active connections (target: < 450)
  • ReadLatency / WriteLatency - Query performance (target: < 5ms)

ElastiCache

  • CPUUtilization - CPU usage (target: < 70%)
  • DatabaseMemoryUsagePercentage - Memory usage (target: < 80%)
  • CacheHitRate - Cache effectiveness (target: > 85%)
  • NetworkBytesIn / NetworkBytesOut - Throughput

ECS Fargate

  • CPUUtilization - Task CPU usage (target: < 70%)
  • MemoryUtilization - Task memory usage (target: < 85%)

Application Load Balancer

  • TargetResponseTime - Backend latency (target: < 50ms)
  • RequestCount - Traffic volume
  • HealthyHostCount - Available targets (target: = desired count)
  • HTTPCode_Target_5XX_Count - Backend errors (target: 0)

Performance Bottleneck Analysis

Symptom Likely Cause Solution
High latency (> 100ms) All WebSocket feeds down, REST fallback active Check WS connections in logs, verify Gemini/Coinbase WS endpoints
Low cache hit rate (< 95%) WebSocket feeds disconnected or stale Check GEMINI-WS/COINBASE-WS logs, verify network connectivity
High CPU on ECS tasks Too many concurrent requests Scale horizontally (add more tasks)
High memory on ECS tasks Memory leak or large response caching Review code for leaks, increase task memory
Aurora ACU spiking to max Heavy database queries or connections Optimize queries, add connection pooling, increase max ACU
Aurora ACU spiking SQS consumer Lambda batch size too large or too frequent Adjust Lambda batch size or batching window in the SQS event source mapping
ElastiCache CPU high Too many cache operations Pipelining deployed ✅ — upgrade node type if still high
ElastiCache memory high Too much cached data Reduce cache TTL or upgrade node type
ALB 5xx errors Backend tasks unhealthy or overloaded Check task logs, scale horizontally

Conclusion

Current Assessment

The Trinity Beast Infrastructure v4.7 is battle-tested at scale. Run 17 validated:

  • 746,374 combined RPS sustained for 30 minutes — 1.34 billion requests with zero degradation
  • 369,600 TCP req/sec and 487,900 UDP req/sec (direct) — 100% success through all 13 concurrency levels
  • 0.3ms TCP avg latency, 0.2ms UDP avg latency
  • 943× improvement from v1.0 baseline across 17 test runs in 19 days
  • 8 vCPU / 32 GB containers — scales from 3 (production) to 9 (proven at scale)
  • 2–18 ACU Aurora range — right-sized with micro-batch write smoothing
  • 6 persistent WebSocket price feeds (Coinbase, Gemini, Kraken, Gate.io, Bybit, OKX) — 150 prewarmed assets
  • 99%+ cache hit rate — virtually every request served from memory
  • ElastiCache-backed API key validation, shared rate limiting, and real-time usage counters
  • v8 UDP engine: SO_REUSEPORT, recvmmsg batch reads, pre-serialized response cache

Recommendation: The system is production-ready and stress-tested well beyond expected traffic. A 3-year Compute Savings Plan is recommended to lock in cost savings on the 8 vCPU / 32 GB Fargate tasks. The remaining optimization opportunities (prewarm strategy, horizontal scaling) are for future scaling — not critical for current operations.

Run 17 eliminated every bottleneck found during stress testing. v4.7 added the v8 UDP engine (SO_REUSEPORT, recvmmsg), dedicated health servers, and 6-exchange WebSocket feeds — the remaining items are future-proofing for horizontal scale.