High-Level Design: System Design
This document outlines the Unified Event-Driven Infrastructure Platform (referred to as System Design), synthesizing high-scale patterns for rapid notifications, nanosecond-precision timing, distributed counting, and risk-managed traffic migration as practiced at organizations like Netflix and Meta.
1. Architecture
1.1 Goals and Constraints
- High Throughput: Handle 150k+ events/sec at peak.
- Near Real-Time Latency: Deliver notifications and state updates across heterogeneous devices (Mobile, Smart TVs, Web) with minimal delay.
- Linearizable Consistency: Provide nanosecond-precision time synchronization for distributed transaction ordering.
- Zero-Downtime Migration: Enable validation of new architectures via production traffic shadowing.
1.2 Mermaid Diagram: Unified Data Flow
graph TD
subgraph "Ingestion & Trigger Layer"
UT[User Triggers] --> EM[Event Management: Manhattan/Mantis]
ST[System Triggers] --> EM
end
subgraph "Time Synchronization (PTP)"
GNSS[GNSS Antenna] --> TC[Time Card: FPGA/Atomic Clock]
TC --> PTP4U[ptp4u Unicast Server]
PTP4U --> Clients[System Nodes: Ordinary Clocks]
end
subgraph "Processing & Routing (RENO)"
EM --> PQ[Priority Queues: SQS]
PQ --> PC[Priority Clusters]
PC --> PS[Persistent Store: Cassandra]
PC --> OBM[Outbound Messaging: Zuul Push/APNS/FCM]
end
subgraph "State & Counting (Distributed Counters)"
PC --> TS[TimeSeries Abstraction]
TS --> RP[Rollup Pipeline]
RP --> RC[Rollup Cache: EVCache]
RC --> PS
end
subgraph "Traffic Validation (Replay Testing)"
PC -- Clone Traffic --> DRS[Dedicated Replay Service]
DRS -- Replay --> NewSys[New System/Cluster]
NewSys -- Diff --> Iceberg[(Apache Iceberg)]
end
OBM --> Devices[End-User Devices]
Devices -- Pull/Poll --> PS
2. Core Components
2.1 Rapid Event Notification (RENO)
- Responsibility: Orchestrates server-initiated communication.
- Interfaces: Hybrid Push (immediate) and Pull (lifecycle-based) models.
- Scaling: Sharded by event priority and member ID to prevent “thundering herd” issues.
- Failure Behavior: Implements Bulkheaded Delivery. If APNS (Apple) is latent, it does not block FCM (Android) or Zuul Push (Web/TV) delivery.
2.2 Precision Time Protocol (PTP) Infrastructure
- Responsibility: Replaces NTP to provide nanosecond-level clock synchronization across the fleet.
- Interfaces:
ptp4u(Unicast PTP server) andfbclockAPI. - State Model: Leverages a Window of Uncertainty (WOU). Transactions are blocked until the
read_timestamp + WOUto guarantee linearizability without expensive quorums. - Recovery: Nodes enter “Holdover Mode” using local atomic clocks if GNSS signal is lost.
2.3 Distributed Counter Abstraction
- Responsibility: Scalable counting (e.g., A/B test facets, view counts).
- Scaling Model:
- Best-Effort: Regional EVCache (Memcached-based) for low-latency, non-critical counts.
- Eventually Consistent: Global event log with a Rollup Pipeline that aggregates events in background batches to avoid Cassandra wide partitions.
- Interfaces:
AddCount(delta, idempotency_token),GetCount().
2.4 Dedicated Replay Service
- Responsibility: Validates functional correctness of new services using production traffic.
- Interfaces: Asynchronous capture via Mantis stream processing.
- Recovery: Decoupled from the critical path. Failure in the replay service has zero impact on the member experience.
3. Scalability, Reliability, Durability
3.1 Scaling Strategy
- Aggressive Scale-Up: Processing clusters (RENO/Rollup) use scaling policies that favor rapid expansion over gradual contraction to handle sudden traffic spikes.
- Indirection via Manhattan: A single source of events allows sharding traffic into priority-specific SQS queues, isolating critical “Maturity Level” changes from “Diagnostic Signals.”
3.2 Reliability & Fault Tolerance
- Event Age Filter: Stale events (older than a configurable threshold) are weeded out early to protect downstream services from flooded queues.
- Online Registry: Zuul maintains a real-time registry of “Always-on” persistent connections, ensuring RENO only attempts push notifications to reachable devices.
3.3 Durability Trade-offs
- Event Sourcing: Every state change is stored as an immutable event. The current state is an Aggregate of these events, allowing for point-in-time “Time Travel” debugging and auditability.
4. Trade-offs
| Feature | Chosen Approach | Alternative Considered | Trade-off Rationale |
|---|---|---|---|
| Time Sync | PTP (Unicast) | NTP or Boundary Clocks | Boundary clocks increased network complexity; Unicast PTP with Transparent Clocks (TC) reduced asymmetry errors. |
| Counting | Event Log + Rollup | Single-Row CAS Updates | Single-row updates cause heavy contention (hot keys). Rollups provide eventual consistency with high write-throughput. |
| Validation | Dedicated Replay Service | Device-Driven Replay | Device-driven replay wastes mobile resources and delays release cycles; dedicated service is decoupled and safer. |
| Communication | Hybrid Push/Pull | Push-Only | Push-only fails for Smart TVs (offline when not in use). Hybrid ensures consistency across all app lifecycles. |
5. Key Takeaways
- Linearizability at Scale: In modern distributed systems, millisecond precision (NTP) is insufficient. PTP enables high-performance commit-wait protocols that save compute power by avoiding quorums.
- Isolation via Indirection: Use an event management engine (Manhattan/Mantis) as a buffer. This allows for deduplication, staleness filtering, and priority sharding before hitting critical processing clusters.
- Shadowing for Confidence: Moving traffic between microservices requires Comparative Analysis. Normalizing timestamps and using “Lineage” (dependency version tracking) are essential to filter noise in replay testing diffs.
- Idempotency is Non-Negotiable: For both event sourcing and distributed counters, an
idempotency_tokenis required to allow safe retries/hedging in unreliable networks.