system design, tutorial,

System Design (HLD)

Upendra Upendra Follow Jan 17, 2026 · 4 mins read
System Design (HLD)
Share this

High-Level Design: System Design

This document outlines the Unified Event-Driven Infrastructure Platform (referred to as System Design), synthesizing high-scale patterns for rapid notifications, nanosecond-precision timing, distributed counting, and risk-managed traffic migration as practiced at organizations like Netflix and Meta.


1. Architecture

1.1 Goals and Constraints

  • High Throughput: Handle 150k+ events/sec at peak.
  • Near Real-Time Latency: Deliver notifications and state updates across heterogeneous devices (Mobile, Smart TVs, Web) with minimal delay.
  • Linearizable Consistency: Provide nanosecond-precision time synchronization for distributed transaction ordering.
  • Zero-Downtime Migration: Enable validation of new architectures via production traffic shadowing.

1.2 Mermaid Diagram: Unified Data Flow

graph TD
    subgraph "Ingestion & Trigger Layer"
        UT[User Triggers] --> EM[Event Management: Manhattan/Mantis]
        ST[System Triggers] --> EM
    end

    subgraph "Time Synchronization (PTP)"
        GNSS[GNSS Antenna] --> TC[Time Card: FPGA/Atomic Clock]
        TC --> PTP4U[ptp4u Unicast Server]
        PTP4U --> Clients[System Nodes: Ordinary Clocks]
    end

    subgraph "Processing & Routing (RENO)"
        EM --> PQ[Priority Queues: SQS]
        PQ --> PC[Priority Clusters]
        PC --> PS[Persistent Store: Cassandra]
        PC --> OBM[Outbound Messaging: Zuul Push/APNS/FCM]
    end

    subgraph "State & Counting (Distributed Counters)"
        PC --> TS[TimeSeries Abstraction]
        TS --> RP[Rollup Pipeline]
        RP --> RC[Rollup Cache: EVCache]
        RC --> PS
    end

    subgraph "Traffic Validation (Replay Testing)"
        PC -- Clone Traffic --> DRS[Dedicated Replay Service]
        DRS -- Replay --> NewSys[New System/Cluster]
        NewSys -- Diff --> Iceberg[(Apache Iceberg)]
    end

    OBM --> Devices[End-User Devices]
    Devices -- Pull/Poll --> PS

2. Core Components

2.1 Rapid Event Notification (RENO)

  • Responsibility: Orchestrates server-initiated communication.
  • Interfaces: Hybrid Push (immediate) and Pull (lifecycle-based) models.
  • Scaling: Sharded by event priority and member ID to prevent “thundering herd” issues.
  • Failure Behavior: Implements Bulkheaded Delivery. If APNS (Apple) is latent, it does not block FCM (Android) or Zuul Push (Web/TV) delivery.

2.2 Precision Time Protocol (PTP) Infrastructure

  • Responsibility: Replaces NTP to provide nanosecond-level clock synchronization across the fleet.
  • Interfaces: ptp4u (Unicast PTP server) and fbclock API.
  • State Model: Leverages a Window of Uncertainty (WOU). Transactions are blocked until the read_timestamp + WOU to guarantee linearizability without expensive quorums.
  • Recovery: Nodes enter “Holdover Mode” using local atomic clocks if GNSS signal is lost.

2.3 Distributed Counter Abstraction

  • Responsibility: Scalable counting (e.g., A/B test facets, view counts).
  • Scaling Model:
    • Best-Effort: Regional EVCache (Memcached-based) for low-latency, non-critical counts.
    • Eventually Consistent: Global event log with a Rollup Pipeline that aggregates events in background batches to avoid Cassandra wide partitions.
  • Interfaces: AddCount(delta, idempotency_token), GetCount().

2.4 Dedicated Replay Service

  • Responsibility: Validates functional correctness of new services using production traffic.
  • Interfaces: Asynchronous capture via Mantis stream processing.
  • Recovery: Decoupled from the critical path. Failure in the replay service has zero impact on the member experience.

3. Scalability, Reliability, Durability

3.1 Scaling Strategy

  • Aggressive Scale-Up: Processing clusters (RENO/Rollup) use scaling policies that favor rapid expansion over gradual contraction to handle sudden traffic spikes.
  • Indirection via Manhattan: A single source of events allows sharding traffic into priority-specific SQS queues, isolating critical “Maturity Level” changes from “Diagnostic Signals.”

3.2 Reliability & Fault Tolerance

  • Event Age Filter: Stale events (older than a configurable threshold) are weeded out early to protect downstream services from flooded queues.
  • Online Registry: Zuul maintains a real-time registry of “Always-on” persistent connections, ensuring RENO only attempts push notifications to reachable devices.

3.3 Durability Trade-offs

  • Event Sourcing: Every state change is stored as an immutable event. The current state is an Aggregate of these events, allowing for point-in-time “Time Travel” debugging and auditability.

4. Trade-offs

Feature Chosen Approach Alternative Considered Trade-off Rationale
Time Sync PTP (Unicast) NTP or Boundary Clocks Boundary clocks increased network complexity; Unicast PTP with Transparent Clocks (TC) reduced asymmetry errors.
Counting Event Log + Rollup Single-Row CAS Updates Single-row updates cause heavy contention (hot keys). Rollups provide eventual consistency with high write-throughput.
Validation Dedicated Replay Service Device-Driven Replay Device-driven replay wastes mobile resources and delays release cycles; dedicated service is decoupled and safer.
Communication Hybrid Push/Pull Push-Only Push-only fails for Smart TVs (offline when not in use). Hybrid ensures consistency across all app lifecycles.

5. Key Takeaways

  • Linearizability at Scale: In modern distributed systems, millisecond precision (NTP) is insufficient. PTP enables high-performance commit-wait protocols that save compute power by avoiding quorums.
  • Isolation via Indirection: Use an event management engine (Manhattan/Mantis) as a buffer. This allows for deduplication, staleness filtering, and priority sharding before hitting critical processing clusters.
  • Shadowing for Confidence: Moving traffic between microservices requires Comparative Analysis. Normalizing timestamps and using “Lineage” (dependency version tracking) are essential to filter noise in replay testing diffs.
  • Idempotency is Non-Negotiable: For both event sourcing and distributed counters, an idempotency_token is required to allow safe retries/hedging in unreliable networks.
Join Newsletter
Get the latest news right in your inbox. We never spam!
Upendra
Written by Upendra Follow
Hi, I am Upendra, the author in Human and machine languages,I don't know to how 3 liner bio works so just Connect with me on social sites you will get to know me better.