Operational reference for the Valkey cache layer — key inventory, dependency classification, graceful degradation behavior, disaster recovery procedures, and the Valkey State Reconciler (sync job). This document replaces the former ElastiCache Key Definitions reference.
Core Principle: Valkey is an operational backbone, not just a cache. It hosts the LRS reporting layer, search indexes, translation state, security intelligence, and cluster coordination. Losing Valkey degrades half the platform's features — but never takes the price API offline.
sync.Map and live exchange WebSocket feeds. Valkey adds speed, never correctness.nil client, uses timeouts on every call, and falls through gracefully when unavailable. When Valkey comes back, the system resumes using it immediately — no restart required.
graph TB
subgraph "Revenue Path (Zero Valkey Dependency)"
WS["WebSocket Feeds\n6 Exchanges"] --> SM["sync.Map\nLocal Cache"]
SM --> API["/price API"]
API --> Customer
end
subgraph "Operational Layer (Valkey-Powered)"
SM -.->|"flush every 30s"| VK[("Valkey\n52 GB")]
VK --> LRS[LRS Reports]
VK --> Search[Full-Text Search]
VK --> TX[Translation State]
VK --> HP[Honeypot Queue]
VK --> AG[Adaptive Governor]
VK --> CS[Cluster Stats]
end
subgraph "Authoritative Sources"
Aurora[("Aurora PostgreSQL")] -.->|"nightly sync"| VK
S3[("S3 Bucket")] -.->|"search rebuild"| VK
Traffic[Live Traffic] -.->|accumulates| VK
end
style SM fill:#064e3b,stroke:#10b981,color:#e2e8f0
style VK fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
style Aurora fill:#451a03,stroke:#f59e0b,color:#e2e8f0
style API fill:#064e3b,stroke:#10b981,color:#e2e8f0
| Property | Value |
|---|---|
| Endpoint | master.trinity-beast-cache.ptsbmm.use2.cache.amazonaws.com:6379 |
| VPC | Data VPC (vpc-0876ee7be3a677f26, 172.31.0.0/16) |
| Client Library | go-redis/v9 UniversalClient |
| Pool Size | 300 per container (5 containers × 300 = 1,500 total) |
| Read Timeout | 3 seconds |
| Write Timeout | 3 seconds |
| Max Retries | 3 |
| Persistence | None (non-persistent ElastiCache — all data is rebuildable) |
| Replication | Single node (no replicas — cost optimization) |
| Encryption | In-transit (TLS 1.2+), at-rest (AWS-managed) |
Why not MemoryDB? We previously ran MemoryDB ($348/mo reserved). Migrated to standard ElastiCache because persistence is unnecessary when every key can be rebuilt from Aurora/S3 sources. The recovery procedures in this document make persistence redundant — the sync job IS the durability layer.
Every Valkey key family is classified by its role in the system and what happens when it disappears.
graph LR
subgraph "Class A: Performance Acceleration"
A1["price:*"]
A2["apikey:*"]
A3["app:config"]
A4["report:config"]
A5["report_count:*"]
A6["errmsg:*"]
A7["public:site-assets"]
A8["tx:params"]
end
subgraph "Class B: Operational State"
B1["usage_logs:* indexes"]
B2["usage_log:* hashes"]
B3["search:index:*"]
B4["report_usage_logs:*"]
B5["docs:registry:*"]
B6["report:text:*"]
end
subgraph "Class C: Coordination"
C1["cluster:stats:*"]
C2["adaptive:*"]
C3["newsletter:lock:*"]
C4["digest:lock:*"]
C5["receipt:session:*"]
end
subgraph "Class D: Intelligence"
D1["honeypot:*"]
D2["autoops:threats:daily"]
D3["autoops:support:*"]
D4["autoops:bedrock:spend:daily"]
end
style A1 fill:#064e3b,stroke:#10b981,color:#e2e8f0
style A2 fill:#064e3b,stroke:#10b981,color:#e2e8f0
style B1 fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
style B3 fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
style C1 fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
style D1 fill:#451a03,stroke:#f59e0b,color:#e2e8f0
| Class | Role | On Loss | Recovery |
|---|---|---|---|
| A | Performance Acceleration | Slower (falls through to Aurora/S3/live source) | Automatic (next read populates via write-through) |
| B | Operational State | Features unavailable (LRS reports, search, doc registry) | Sync Job (full rebuild in one run) |
| C | Coordination | Local-only operation (per-container, not cluster-wide) | Self-healing (TTL expiry or next write) |
| D | Intelligence | Accumulated data lost (honeypot history, support gaps) | Accumulates from live traffic — no rebuild needed |
| Key Pattern | Type | TTL | Writer | Reader | Authoritative Source |
|---|---|---|---|---|---|
price:{ASSET} | STRING (JSON) | 93 days | PriceEngine flush (30s cycle), Kraken prewarm | PriceHandler, BatchHandler | sync.Map (WebSocket feeds) |
apikey:{api_key} | HASH | None (refreshed by sync) | Sync Job, LPO write-through | API key middleware | Aurora api_keys table |
app:config | HASH | None | Sync Job, LPO write-through | ParamLoader (5-min poll) | Aurora application_parameters |
report:config | HASH | None | Sync Job | LRS LoadReportParameters | Aurora report_parameters |
report_count:{uuid} | HASH | 24 hours | LRS counter (write-through) | LRS CheckLimits | Aurora report_count |
errmsg:{lang}:{key} | STRING | None | Sync Job | GetErrorMessage | Aurora error_messages |
public:site-assets | STRING (JSON) | 1 hour | Site assets handler | Same | S3 bucket listing |
tx:params | HASH | 27 hours | Sync Job | Translation service | Aurora translation_parameters |
translation:cost_per_chunk* | STRING | 48 hours | Sync Job | Translation quote endpoint | Computed from Aurora actuals |
| Key Pattern | Type | TTL | Writer | Reader | Count |
|---|---|---|---|---|---|
usage_logs:index | SORTED SET | None (pruned to 93 days) | Sync Job | LRS handlers | ~39,000 members |
usage_logs:api_key:{id} | SORTED SET | None | Sync Job | LRS handlers | Per API key |
usage_logs:asset:{ASSET} | SORTED SET | None | Sync Job | LRS handlers | Per asset |
usage_log:{uuid} | HASH | None | Sync Job | LRS PipelineHGetAll | ~39,000 keys |
report_usage_logs:index | SORTED SET | None | Sync Job + LRS logger | LRS detail/summary handlers | Variable |
report_usage_logs:api_key:{id} | SORTED SET | None | Sync Job + LRS logger | LRS handlers | Per API key |
report_usage_log:{uuid} | HASH | None | Sync Job + LRS logger | LRS PipelineHGetAll | Variable |
search:index:{lang} | STRING (JSON) | None | BuildSearchIndex handler | Search handler | 12 keys (~500 KB each) |
docs:registry:{file} | STRING (JSON) | None | Doc registry handler | Doc registry endpoints | ~37 keys |
docs:registry:index | SET | None | Doc registry handler | Doc registry list | 1 key |
docs:pending:translation | SORTED SET | None | Doc publish handler | translate-pending command | 1 key |
docs:session:log | LIST | None | session-close handler | Admin endpoint | 1 key |
report:text:{YYYY-MM-DD} | STRING | 30 days | Sync Job (from S3) | Digest Lambda (newsletter) | ~30 keys |
| Key Pattern | Type | TTL | Writer | Reader |
|---|---|---|---|---|
cluster:stats:{NodeName} | STRING (JSON) | 30 seconds | Metrics publisher (every 3s) | ClusterStatsHandler |
{adaptive:{name}}:successes | STRING (counter) | 60 seconds | Adaptive governor syncLoop | Governor readThrottleState |
{adaptive:{name}}:total | STRING (counter) | 60 seconds | Adaptive governor syncLoop | Governor readThrottleState |
{adaptive:{name}}:throttle | STRING | 30 seconds | Adaptive governor | All containers (coordinated throttle) |
newsletter:lock:{year}-W{week} | STRING | 30 minutes | Digest Lambda (SET NX) | Same (dedup check) |
digest:lock:{type}:{date} | STRING | 30 minutes | Digest Lambda (SET NX) | Same (dedup check) |
receipt:session:{sessionID} | STRING | 1 hour | Receipt Lambda | Same (dedup check) |
usage:counter:{apikey} | HASH | 48 hours | Price handler | LRS real-time stats |
sync:last_timestamp | STRING | None | Sync Job | Sync Job (high-water mark) |
| Key Pattern | Type | TTL | Writer | Reader |
|---|---|---|---|---|
honeypot:ip:{ip} | HASH | None | Honeypot handler | Stats, Bedrock analyzer |
honeypot:log | SORTED SET | Trimmed to 7 days | Honeypot handler | Bedrock analyzer, stats |
honeypot:autoblock_queue | LIST | None (consumed by processor) | Honeypot handler (LPUSH) | Honeypot processor Lambda (RPOP) |
honeypot:blocked_ips | SET | None | Honeypot handler | Stats endpoint |
autoops:threats:daily | STRING (JSON) | Overwritten every 5 min | Bedrock analyze Lambda | KCC threat-status |
autoops:support:knowledge:b64 | STRING | None | Admin push / Sync Job | Rhema Lambda |
autoops:support:gaps | SORTED SET | 30 days (EXPIRE) | Rhema Lambda, Rhema API | Digest Lambda (weekly) |
autoops:support:weekly | LIST | 8 days (EXPIRE) | Rhema Lambda | Digest Lambda |
autoops:bedrock:spend:daily | STRING (counter) | 24 hours | Translation engine (INCRBY) | Translation submit (cost cap) |
kcc:daily | STRING (JSON) | 24 hours | KCC daily-collect | KCC daily-render |
| Key Pattern | Type | TTL | Writer | Reader |
|---|---|---|---|---|
tx:job:{id} | STRING (JSON) | 24 hours | Translation handlers | Status poll endpoint |
tx:active | SET | None | Translation submit/cancel | Queue check, status |
tx:history | LIST | None (pruned) | Translation finalize | History endpoint |
tx:idempotency:{key} | STRING | 24 hours | Translation submit | Same (dedup check) |
What happens to each feature when Valkey is unavailable:
| Feature | Without Valkey | User Impact | Risk |
|---|---|---|---|
Price API (/price, /prices) | Bypasses L2 cache, serves from sync.Map (L1) or live exchange (L3) | None — same data, +50ms latency on cache miss | None |
| API Key Validation | Falls through to Aurora query | +5ms per request until local cache warms | None |
| Application Parameters | Falls through to Aurora | None — same data | None |
| LRS Usage Reports | Returns 503 with clear message | Reports unavailable — restored on next sync | Critical |
| LRS Report Usage Detail | Returns 503 | Report history unavailable | Critical |
| Full-Text Search | Returns stale in-memory cache (5-min window), then empty | Search broken until rebuild | High |
| Error Messages (i18n) | Falls through to English → hardcoded fallback | Non-English users see English errors | Low |
| Cluster Stats | Nodes missing from response | Admin dashboard shows partial data | Low |
| Adaptive Governor | Falls back to per-container local counters | No cluster-wide coordination — each container throttles independently | Medium |
| Honeypot System | Stops accumulating hit data; auto-block queue stalls | Scanners not blocked until Valkey returns (WAF existing blocks persist) | Medium |
| Translation Real-Time Progress | Status polls return 404; fall back to Aurora for completed jobs | No live progress bar — check back when done | Low |
| Document Registry | Admin doc management endpoints return empty | Doc workflow broken, no data loss (S3 is source) | Medium |
| Newsletter Dedup Locks | Lock check fails → proceeds without dedup | Possible duplicate email send (unlikely) | Low |
| Rhema Knowledge Base | Rhema operates without context (hardcoded fallback) | Support responses less accurate | Medium |
| Bedrock Spend Cap | Counter resets to 0 — cap loses memory | Translation could temporarily exceed $600/day soft cap | Medium |
| Webhook Delivery | No Valkey dependency — reads from local sync.Map | None | None |
Key Insight: The price API, webhook delivery, API key validation, and application parameters — the four pillars of the revenue path — are ALL resilient to Valkey loss. The platform continues to serve customers. Only internal tooling and reporting degrade.
BeastReconciler (trinity-beast-sync-job, 1 AM EST nightly) is the single recovery mechanism for Valkey. It is a first-class member of the ECS cluster alongside BeastMain, BeastMirror, BeastLRS, BeastWebhook, and BeastTranslate. After a cold start (new node, failover, or recovery), running BeastReconciler once restores 100% of operational state.
flowchart TD
Start([BeastReconciler Start]) --> CheckFirst{"First Run?
usage_logs:index exists?"}
CheckFirst -->|"Yes - Cold Start"| Full["Full Historical Load
93 days from Aurora"]
CheckFirst -->|"No - Incremental"| Inc["Incremental Load
since last_timestamp"]
Full --> Prune["Prune Old Data
remove entries > 93 days"]
Inc --> Prune
Prune --> RUL["Sync Report Usage Logs
Aurora to Valkey"]
RUL --> Keys["Sync API Keys
Aurora to Valkey hashes"]
Keys --> Params["Sync App Params
Aurora to app:config hash"]
Params --> ErrMsg["Sync Error Messages
Aurora to errmsg:* strings"]
ErrMsg --> TxCost["Sync Translation Costs
Calculate averages"]
TxCost --> TxParams["Sync Translation Params
Aurora to tx:params hash"]
TxParams --> Rhema["Sync Rhema Knowledge
S3 to autoops:support:knowledge:b64"]
Rhema --> DocReg["Sync Doc Registry
S3 listing to docs:registry:*"]
DocReg --> Reports["Generate Report Text
S3 HTML to report:text:*"]
Reports --> Search["Rebuild Search Index
CloudFront to search:index:*"]
Search --> Health["Verify Valkey Health
PING + DBSIZE + baseline check"]
Health --> End([Complete])
style Full fill:#7f1d1d,stroke:#fca5a5,color:#e2e8f0
style Inc fill:#064e3b,stroke:#10b981,color:#e2e8f0
style Rhema fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
style DocReg fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
style Health fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
| # | Function | Source | Target Keys |
|---|---|---|---|
| 1 | syncHistorical / syncIncremental | Aurora usage_logs | usage_logs:index, usage_logs:api_key:*, usage_logs:asset:*, usage_log:* |
| 2 | pruneOldData | Valkey (time-based removal) | All usage_logs:* keys |
| 3 | syncReportUsageLogs | Aurora report_usage_logs | report_usage_logs:*, report_usage_log:* |
| 4 | syncAPIKeys | Aurora api_keys | apikey:* hashes |
| 5 | syncAppParams | Aurora application_parameters | app:config hash |
| 6 | syncErrorMessages | Aurora error_messages | errmsg:{lang}:{key} |
| 7 | syncTranslationCostPerChunk | Aurora (computed averages) | translation:cost_per_chunk* |
| 8 | syncTranslationParams | Aurora translation_parameters | tx:params |
| 9 | resetMonthlyCounts (1st of month) | Aurora (zero counters) | Flush report_count:* |
| 10 | generateDailyReportText | S3 daily-reports/ | report:text:{YYYY-MM-DD} |
| 11 | rebuildSearchIndex | CloudFront docs (all languages) | search:index:{lang} |
| 12 | pruneReportManifest | S3 lifecycle | S3 objects (not Valkey) |
| # | Function | Source | Target Keys | Purpose |
|---|---|---|---|---|
| 13 | syncRhemaKnowledge | S3 rhema/knowledge-base.txt | autoops:support:knowledge:b64 | Ensures Rhema always has her context, even after cold start |
| 14 | syncDocRegistry | S3 docs/ listing + file metadata | docs:registry:*, docs:registry:index | Rebuilds the document lifecycle registry from the S3 source of truth |
| 15 | verifyValkeyHealth | Valkey (PING, DBSIZE, INFO) | Logs only | Post-sync validation — confirms key count meets baseline, flags anomalies |
Recovery Guarantee: After a full BeastReconciler run (FORCE_FULL_SYNC=true), every Class A and Class B key is populated. Class C keys self-heal within 60 seconds of containers resuming operation. Class D keys accumulate from live traffic — no rebuild needed or possible.
For situations where BeastReconciler hasn't run yet, or you need immediate recovery of specific subsystems:
# Trigger BeastReconciler immediately (doesn't wait for 1 AM)
aws ecs run-task --cluster trinity-beast-fargate-cluster \
--task-definition trinity-beast-sync-job \
--overrides '{"containerOverrides":[{"name":"sync-container","environment":[{"name":"FORCE_FULL_SYNC","value":"true"}]}]}' \
--network-configuration '...' --region us-east-2
# Rebuild search index only (fast — ~30 seconds)
bash scripts/kcc.sh build-search
# Rebuild doc registry from S3 (future command)
bash scripts/kcc.sh rebuild-doc-registry
# Verify Valkey health and key count
bash scripts/kcc.sh valkey-health
sequenceDiagram
participant Op as Operator
participant ECS as ECS Containers
participant BR as BeastReconciler
participant VK as Valkey - New Node
participant Aurora as Aurora
participant S3 as S3
Note over VK: Valkey node replaced - cold start
Op->>BR: Trigger FORCE_FULL_SYNC=true
BR->>Aurora: SELECT * FROM usage_logs (93 days)
BR->>VK: Batch HSET + ZADD (39K+ entries)
BR->>Aurora: SELECT * FROM api_keys
BR->>VK: HSET apikey:* (all active keys)
BR->>Aurora: SELECT * FROM application_parameters
BR->>VK: HSET app:config
BR->>Aurora: SELECT * FROM error_messages
BR->>VK: SET errmsg:*
BR->>S3: Read rhema/knowledge-base.txt
BR->>VK: SET autoops:support:knowledge:b64
BR->>S3: ListObjects docs/
BR->>VK: SET docs:registry:* + SADD index
BR->>S3: Read daily-reports/ (30 days)
BR->>VK: SET report:text:*
BR->>ECS: POST /admin/build-search-index
ECS->>VK: SET search:index:* (12 languages)
BR->>VK: PING + DBSIZE (verify)
Note over VK: Full operational state restored
ECS->>VK: Containers auto-resume writes
When ElastiCache replaces the node (maintenance, failure, or manual action):
RedisClient calls timeout (3s) and return errors. Application falls through to Aurora/sync.Map for all Class A keys.go-redis has MaxRetries: 3 and auto-reconnects. Once the new node is available, existing connections fail but new ones succeed. No container restart needed.FORCE_FULL_SYNC=true. Duration: ~4 minutes for 39K usage logs + all supporting data.bash scripts/kcc.sh valkey-health — confirms DBSIZE meets baseline (~42,000 keys).FORCE_FULL_SYNC needed — it detects existing high-water mark).DBSIZE below baseline, or specific subsystem returning unexpected errors.EXISTS usage_logs:index, EXISTS app:config, DBSIZE.| Scenario | Time to Full Recovery | Customer Impact During |
|---|---|---|
| Node replacement (planned) | ~5 minutes (new node + sync) | None (price API unaffected) |
| Network blip (< 60s) | Instant (auto-reconnect) | None |
| Extended outage (> 5 min) | Immediate on return + sync for gap fill | LRS reports unavailable |
| Full data loss (cold start) | ~4 minutes (BeastReconciler full run) | LRS + search unavailable until reconciliation completes |
| Alarm | Metric | Threshold | Action |
|---|---|---|---|
Trinity-Beast-ElastiCache-CPU-High | CPUUtilization | > 80% for 5 min | SNS → AutoOps notify |
Trinity-Beast-ElastiCache-Memory-High | DatabaseMemoryUsagePercentage | > 80% for 5 min | SNS → AutoOps notify |
/lrs/health — calls PING on Valkey. Returns 503 if unreachable.POST /admin/valkey with {"command":"PING"} — direct connectivity test.| Metric | Expected Range | Concern If |
|---|---|---|
| DBSIZE (total keys) | 40,000 – 45,000 | < 35,000 (data loss) or > 60,000 (leak) |
| Memory Usage | < 1% of 52 GB | > 5% (unexpected growth) |
| CPU | 2–5% | > 20% (hot key or pipeline issue) |
| Hit Rate | 90–95% | < 80% (cold cache or miss pattern) |
| Connected Clients | ~1,500 (5 containers × 300) | < 500 (containers down) or > 2,000 (leak) |
June 3, 2026 — BeastReconciler validated in production. Full state reconciliation of 43,065 keys completed in 2.36 seconds. Zero errors. All 15 responsibilities executed successfully. The system can recover from a complete Valkey cold start in under 3 seconds.
FORCE_FULL_SYNC=true (simulates cold-start recovery — rebuilds everything from scratch)ecs run-task with environment override| Function | Result | Details |
|---|---|---|
syncHistorical | ✅ | 40,571 usage logs loaded (93 days from Aurora) |
pruneOldData | ✅ | 0 pruned (all within retention window) |
syncReportUsageLogs | ✅ | 0 new (high-water mark current) |
syncAPIKeys | ✅ | 7 active keys written to Valkey hashes |
syncAppParams | ✅ | 86 parameters → app:config hash |
syncErrorMessages | ✅ | 252 messages → errmsg:{lang}:{key} |
syncTranslationCostPerChunk | ✅ | 40 params → tx:params. Cost averages: Haiku $0.0327/chunk (1 pair, clamped to floor), Sonnet 4.6 $0.0570/chunk (230 pairs). Blended: $0.0569/chunk. |
syncTranslationLogs | ✅ | 243 translation jobs synced with full indexing (global, per-key, per-doc, per-lang, per-model) |
generateDailyReportText | ✅ | 25 reports already in Valkey (idempotent — no rework) |
pruneReportManifest | ✅ | 166 entries, all within 30-day window |
rebuildSearchIndex | ✅ | Accepted (HTTP 202 — builds asynchronously in background) |
syncRhemaKnowledge | ✅ | 37,260 bytes read from S3, written to autoops:support:knowledge:b64 |
syncDocRegistry | ✅ | 42 documents scanned from S3 docs/ prefix. All 42 already had registry entries (idempotent). |
verifyValkeyHealth | ✅ | HEALTHY — DBSIZE: 43,065 keys | Memory: 158.44 MB (0.3% of 52 GB capacity) |
execution_mode column to translation_jobs, the per-model cost averages compute correctly from 7-day actuals with floor/ceiling clamping and 9% infrastructure markup.Conclusion: BeastReconciler is production-ready. A complete Valkey node loss and replacement can be recovered in under 5 minutes (container startup + 2.4s reconciliation). The price API is unaffected throughout — zero customer impact during cache-layer disaster recovery.
Key families that are being phased out or have been removed:
| Key Pattern | Status | Reason | Retirement Date |
|---|---|---|---|
lang:{code} | Deprecated | Being replaced by pre-rendered translated pages (same folder pattern as doc library: /{lang}/page.html). The i18n JSON API becomes unnecessary when every page exists as a static translated file. Language selection becomes a simple path prefix redirect based on localStorage('cpmp-lang'). | Pending (web page translation batches 3-5) |
i18n:job:{uuid} | Deprecated | Part of the JSON i18n system being retired alongside lang:{code}. | Pending |
search:index (legacy, no lang suffix) | Removed | Replaced by per-language indexes search:index:{lang}. | May 2026 |
Architecture Decision (June 3, 2026): The lang:* keys and the /public/lang/{code} API endpoint will be retired entirely once all 33 web pages are translated via the translation engine and deployed to language subfolders on S3. At that point, language selection routes users directly to pre-rendered static HTML — no API calls, no Valkey reads, no runtime text swapping. The JSON i18n system was a bridge; the translated pages are the destination.