Stock Exchange

We’ll design an electronic stock exchange in this chapter.

Its basic function is to efficiently match buyers and sellers.

Major stock exchanges are NYSE, NASDAQ, among others.

Step 1 - Understand the Problem and Establish Design scope

C: Which securities are we going to trade? Stocks, options or futures?
I: Only stocks for simplicity
C: Which order types are supported - place, cancel, replace? What about limit, market, conditional orders?
I: We need to support placing and canceling an order. We need to only consider limit orders for the order type.
C: Does the system need to support after hours trading?
I: No, just normal trading hours
C: Could you describe the exchange’s basic functions?
I: Clients can place or cancel limit orders and receive matched trades in real-time. They should be able to see the order book in real time.
C: What’s the scale of the exchange?
I: Tens of thousands of users trading at the same time and ~100 symbols. Billions of orders per day. We need to also support risk checks for compliance.
C: What kind of risk checks?
I: Let’s do simple risk checks - eg limiting a user to trade only 1mil apple stocks in a day
C: How about user wallet engagement?
I: We need to ensure clients have sufficient funds before placing orders. Funds meant for pending orders need to be withheld until order is finalized.

Non-functional requirements

The scale mentioned by the interviewer hints that we are to design a small to medium scale exchange. We need to also ensure flexibility to support more symbols and users in the future.

Other non-functional requirements:

Availability - At least 99.99%. Downtime can harm reputation
Fault tolerance - fault tolerance and a fast recovery mechanism are needed to limit the impact of a production incident
Latency - Round-trip latency should be in the ms level with focus on 99th percentile. Persistently high 99p latency causes bad experience for a handful or users.
Security - We should have an account management system. For legal compliance, we need to support KYC to verify user identity. We should also protect against DDoS for public resources.

Back-of-the-envelope estimation

100 symbols, 1bil orders per day
Normal trading hours are from 09:30 to 16:00 (6.5h)
QPS = 1bil / 6.5 / 3600 = 43000
Peak QPS = 5*QPS = 215000
Trading volume is significantly higher when the market opens

Step 2 - Propose High-Level Design and Get Buy-In

Business Knowledge 101

Let’s discuss some basic concepts, related to an exchange.

A broker mediates interactions between an exchange and end users - Robinhood, Fidelity, etc.

Institutional clients trade in large quantities using specialized trading software. They need specialized treatment. Eg order splitting when trading in large volumes to avoid impacting the market.

Types of orders:

Limit - buy or sell at a fixed price. It might not find a match immediately or it might be partially matched.
Market - doesn’t specify a price. Executed at the current market price immediately.

Prices:

Bid - highest price a buyer is willing to buy a stock
Ask - lowest price a seller is willing to sell a stock

The US market has three tiers of price quotes - L1, L2, L3.

L1 market data contains best bid/ask prices and quantities:

L2 includes more price levels:

L3 shows levels and queued quantity at each level:

A candlestick shows the market open and close price, as well as the highest and lowest prices in the given interval:

FIX is a protocol for exchanging securities transaction information, used by most vendors. Example securities transaction:

8=FIX.4.2 | 9=176 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 |

High-level design

Trade flow:

Client places order via trading interface
Broker sends the order to the exchange
Order enters exchange through client gateway, which validates, rate limits, authenticates, etc. Order is forwarded to order manager.
Order manager performs risk checks based on rules set by the risk manager
After passing risk checks, order manager verifies there are sufficient funds in the wallet for the order
Order is sent to matching engine. When match is found, matching engine emits two executions (called fills) for buy and sell. Both orders are sequenced so that they’re deterministic.
Executions are returned to the client.

Market data flow (M1-M3):

matching engine generates a stream of executions, sent to the market data publisher
Market data publisher constructs the candlestick charts and sends them to the data service
Market data is stored in specialized storage for real-time analytics. Brokers connect to the data service for timely market data.

Reporter flow (R1-R2):

reporter collects all necessary reporting fields from orders and executions and writes them to DB
reporting fields - client_id, price, quantity, order_type, filled_quantity, remaining_quantity

Trading flow is on the critical path, whereas the rest of the flows are not, hence, latency requirements differ between them.

Trading flow

The trading flow is on the critical path, hence, it should be highly optimized for low latency.

The matching engine is at its heart, also called the cross engine. Primary responsibilities:

Maintain the order book for each symbol - a list of buy/sell orders for a symbol.
Match buy and sell orders - a match results in two executions (fills), with one each for the buy and sell sides. This function must be fast and accurate
Distribute the execution stream as market data
Matches must be produced in a deterministic order. Foundational for high availability

Next is the sequencer - it is the key component making the matching engine deterministic by stamping each inbound order and outbound fill with a sequence ID.

We stamp inbound orders and outbound fills for several reasons:

timeliness and fairness
fast recovery/replay
exactly-once guarantee

Conceptually, we could use Kafka as our sequencer since it’s effectively an inbound and outbound message queue. However, we’re going to implement it ourselves in order to achieve lower latency.

The order manager manages the orders state. It also interacts with the matching engine - sending orders and receiving fills.

The order manager’s responsibilities:

Sends orders for risk checks - eg verifying user’s trade volume is less than 1mil
Checks the order against the user wallet and verifies there are sufficient funds to execute it
It sends the order to the sequencer and on to the matching engine. To reduce bandwidth, only necessary order information is passed to the matching engine
Executions (fills) are received back from the sequencer, where they are then send to the brokers via the client gateway

The main challenge with implementing the order manager is the state transition management. Event sourcing is one viable solution (discussed in deep dive).

Finally, the client gateway receives orders from users and sends them to the order manager. Its responsibilities:

Since the client gateway is on the critical path, it should stay lightweight.

There can be multiple client gateways for different clients. Eg a colo engine is a trading engine server, rented by the broker in the exchange’s data center:

Market data flow

The market data publisher receives executions from the matching engine and builds the order book/candlestick charts from the execution stream.

That data is sent to the data service, which is responsible for showing the aggregated data to subscribers:

Reporting flow

The reporter is not on the critical path, but it is an important component nevertheless.

It is responsible for trading history, tax reporting, compliance reporting, settlements, etc. Latency is not a critical requirement for the reporting flow. Accuracy and compliance are more important.

API Design

Clients interact with the stock exchange via the brokers to place orders, view executions, market data, download historical data for analysis, etc.

We use a RESTful API for communication between the client gateway and the brokers.

For institutional clients, a proprietary protocol is used to satisfy their low-latency requirements.

Create order:

POST /v1/order

Parameters:

symbol - the stock symbol. String
side - buy or sell. String
price - the price of the limit order. Long
orderType - limit or market (we only support limit orders in our design). String
quantity - the quantity of the order. Long

Response:

id - the ID of the order. Long
creationTime - the system creation time of the order. Long
filledQuantity - the quantity that has been successfully executed. Long
remainingQuantity - the quantity still to be executed. Long
status - new/canceled/filled. String
rest of the attributes are the same as the input parameters

Get execution:

GET /execution?symbol={:symbol}&orderId={:orderId}&startTime={:startTime}&endTime={:endTime}

Parameters:

symbol - the stock symbol. String
orderId - the ID of the order. Optional. String
startTime - query start time in epoch [11]. Long
endTime - query end time in epoch. Long

Response:

executions - array with each execution in scope (see attributes below). Array
id - the ID of the execution. Long
orderId - the ID of the order. Long
symbol - the stock symbol. String
side - buy or sell. String
price - the price of the execution. Long
orderType - limit or market. String
quantity - the filled quantity. Long

Get order book:

GET /marketdata/orderBook/L2?symbol={:symbol}&depth={:depth}

Parameters:

symbol - the stock symbol. String
depth - order book depth per side. Int

Response:

bids - array with price and size. Array
asks - array with price and size. Array

get candlesticks:

GET /marketdata/candles?symbol={:symbol}&resolution={:resolution}&startTime={:startTime}&endTime={:endTime}

Parameters:

symbol - the stock symbol. String
resolution - window length of the candlestick chart in seconds. Long
startTime - start time of the window in epoch. Long
endTime - end time of the window in epoch. Long

Response:

candles - array with each candlestick data (attributes listed below). Array
open - open price of each candlestick. Double
close - close price of each candlestick. Double
high - high price of each candlestick. Double
low - low price of each candlestick. Double

Data models

There are three main types of data in our exchange:

Product, order, execution
order book
candlestick chart

Product, order, execution

Products describe the attributes of a traded symbol - product type, trading symbol, UI display symbol, etc.

This data doesn’t change frequently, it is primarily used for rendering in a UI.

An order represents an instruction for a buy/sell order. Executions are outbound matched result.

Here’s the data model:

We encounter orders and executions in all of our three flows:

in the critical path, they are processed in-memory for high performance. They are stored and recovered from the sequencer.
The reporter writes orders and executions to the database for reporting use-cases
Executions are forwarded to market data to reconstruct the order book and candlestick chart

Order book

The order book is a list of buy/sell orders for an instrument, organized by price level.

An efficient data structure for this model, needs to satisfy:

constant lookup time - getting volume at price level or between price levels
fast add/execute/cancel operations
query best bid/ask price
iterate through price levels

Example order book execution:

After fulfilling this large order, the price increases as the bid/ask spread widens.

Example order book implementation in pseudo code:

class PriceLevel{
    private Price limitPrice;
    private long totalVolume;
    private List<Order> orders;
}

class Book<Side> {
    private Side side;
    private Map<Price, PriceLevel> limitMap;
}

class OrderBook {
    private Book<Buy> buyBook;
    private Book<Sell> sellBook;
    private PriceLevel bestBid;
    private PriceLevel bestOffer;
    private Map<OrderID, Order> orderMap;
}

For a more efficient implementation, we can use a doubly-linked list instead of a standard list:

Placing a new order is O(1), because we’re adding an order to the tail of the list.
Matching an order is O(1), because we are deleting an order from the head
Canceling an order means deleting an order from the order book. We utilize orderMap for O(1) lookup and O(1) delete (due to the Order having a reference to the previous element in the list).

This data structure is also used in the market data services to reconstruct the order book.

Candlestick chart

The candlestick data is calcualated within the market data services based on processing orders in a time interval:

class Candlestick {
    private long openPrice;
    private long closePrice;
    private long highPrice;
    private long lowPrice;
    private long volume;
    private long timestamp;
    private int interval;
}

class CandlestickChart {
    private LinkedList<Candlestick> sticks;
}

Some optimizations to avoid consuming too much memory:

Use pre-allocated ring buffers to hold sticks to reduce the allocation number
Limit the number of sticks in memory and persist the rest to disk

We’ll use an in-memory columnar database (eg KDB) for real-time analytics. After market close, data is persisted in historical database.

Step 3 - Design Deep Dive

One interesting thing to be aware of about modern exchanges is that unlike most other software, they typically run everything on one gigantic server.

Let’s explore the details.

Performance

For an exchange, it is very important to have good overall latency for all percentiles.

How can we reduce latency?

Reduce the number of tasks on the critical path
Shorten the time spent on each task by reducing network/disk usage and/or reducing task execution time

To achieve the first goal, we’re stripped the critical path from all extraneous responsibility, even logging is removed to achieve optimal latency.

If we follow the original design, there are several bottlenecks - network latency between services and disk usage of the sequencer.

With such a design we can achieve tens of milliseconds end to end latency. We want to achieve tens of microseconds instead.

Hence, we’ll put everything on one server and processes are going to communicate via mmap as an event store:

Another optimization is using an application loop (while loop executing mission-critical tasks), pinned to the same CPU to avoid context switching:

Another side effect of using an application loop is that there is no lock contention - multiple threads fighting for the same resource.

Let’s now explore how mmap works - it is a UNIX syscall, which maps a file on disk to an application’s memory.

One trick we can use is creating the file in /dev/shm, which stands for “shared memory”. Hence, we have no disk access at all.

Event sourcing

Event sourcing is discussed in-depth in the digital wallet chapter. Reference it for all the details.

In a nutshell, instead of storing current states, we store immutable state transitions:

On the left - traditional schema
On the right - event source schema

Here’s how our design looks like thus far:

external domain interacts with our client gateway using the FIX protocol
Order manager receives the new order event, validates it and adds it to its internal state. Order is then sent to matching core
If order is matched, the OrderFilledEvent is generated and sent over mmap
Other components subscribe to the event store and do their part of the processing

One additional optimizations - all components hold a copy of the order manager, which is packaged as a library to avoid extra calls for managing orders

The sequencer in this design, changes to not be an event store, but be a single writer, sequencing events before forwarding them to the event store:

High availability

We aim for 99.99% availability - only 8.64s of downtime per day.

To achieve that, we have to identify single-point-of-failures in the exchange architecture:

setup backup instances of critical services (eg matching engine) which are on stand-by
aggressively automate failure detection and failover to the backup instance

Stateless services such as the client gateway can easily be horizontally scaled by adding more servers.

For stateful components, we can process inbound events, but not publish outbound events if we’re not the leader:

To detect the primary replica being down, we can send heartbeats to detect that its non-functional.

This mechanism only works within the boundary of a single server. If we want to extend it, we can setup an entire server as hot/warm replica and failover in case of failure.

To replicate the event store across the replicas, we can use reliable UDP for faster communication.

Fault tolerance

What if even the warm instances go down? It is a low probability event but we should be ready for it.

Large tech companies tackle this problem by replicating core data to data centers in multiple cities to mitigate eg natural disasters.

Questions to consider:

If the primary instance is down, how and when do we failover to the backup instance?
How do we choose the leader among the backup instances?
What is the recovery time needed (RTO - recovery time objective)?
What functionalities need to be recovered? Can our system operate under degraded conditions?

How to address these:

System can be down due to a bug (affecting primary and replicas), we can use chaos engineering to surface edge-cases and disastrous outcomes like these
Initially though, we could perform failovers manually until we gather sufficient knowledge about the system’s failure modes
leader-election can be used (eg Raft) to determine which replica becomes the leader in the event of the primary going down

Example of how replication works across different servers:

Example leader-election terms:

For details on how Raft works, check this out

Finally, we need to also consider loss tolerance - how much data can we lose before things get critical? This will determine how often we backup our data.

For a stock exchange, data loss is unacceptable, so we have to backup data often and rely on raft’s replication to reduce probability of data loss.

Matching algorithms

Slight detour on how matching works via pseudo code:

Context handleOrder(OrderBook orderBook, OrderEvent orderEvent) {
    if (orderEvent.getSequenceId() != nextSequence) {
        return Error(OUT_OF_ORDER, nextSequence);
    }

    if (!validateOrder(symbol, price, quantity)) {
        return ERROR(INVALID_ORDER, orderEvent);
    }

    Order order = createOrderFromEvent(orderEvent);
    switch (msgType):
        case NEW:
            return handleNew(orderBook, order);
        case CANCEL:
            return handleCancel(orderBook, order);
        default:
            return ERROR(INVALID_MSG_TYPE, msgType);

}

Context handleNew(OrderBook orderBook, Order order) {
    if (BUY.equals(order.side)) {
        return match(orderBook.sellBook, order);
    } else {
        return match(orderBook.buyBook, order);
    }
}

Context handleCancel(OrderBook orderBook, Order order) {
    if (!orderBook.orderMap.contains(order.orderId)) {
        return ERROR(CANNOT_CANCEL_ALREADY_MATCHED, order);
    }

    removeOrder(order);
    setOrderStatus(order, CANCELED);
    return SUCCESS(CANCEL_SUCCESS, order);
}

Context match(OrderBook book, Order order) {
    Quantity leavesQuantity = order.quantity - order.matchedQuantity;
    Iterator<Order> limitIter = book.limitMap.get(order.price).orders;
    while (limitIter.hasNext() && leavesQuantity > 0) {
        Quantity matched = min(limitIter.next.quantity, order.quantity);
        order.matchedQuantity += matched;
        leavesQuantity = order.quantity - order.matchedQuantity;
        remove(limitIter.next);
        generateMatchedFill();
    }
    return SUCCESS(MATCH_SUCCESS, order);
}

This matching algorithm uses the FIFO algorithm for determining which orders at a price level to match.

Determinism

Functional determinism is guaranteed via the sequencer technique we used.

The actual time when the event happens doesn’t matter:

Latency determinism is something we have to track. We can calculate it based on monitoring 99 or 99.99 percentile latency.

Things which can cause latency spikes are garbage collector events in eg Java.

Market data publisher optimizations

The market data publisher receives matched results from the matching engine and rebuilds the order book and candlestick charts based on them.

We only keep part of the candlesticks as we don’t have infinite memory. Clients can choose how much granular info they want. More granular info might require a higher price:

A ring buffer (aka circular buffer) is a fixed-size queue with the head connected to the tail. The space is preallocated to avoid allocations. The data structure is also lock-free.

Another technique to optimize the ring buffer is padding, which ensures the sequence number is never in a cache line with anything else.

Distribution fairness of market data and multicast

We need to ensure subscribers receive the data at the same time since if one receives data before another, that gives them crucial market insight, which they can use to manipulate the market.

To achieve this, we can use multicast using reliable UDP when publishing data to subscribers.

Data can be transported via the internet in three ways:

Unicast - one source, one destination
Broadcast - one source to entire subnetwork
Multicast - one source to a set of hosts on different subnetworks

In theory, by using multicast, all subscribers should receive the data at the same time.

UDP, however, is unreliable and the data might not reach everyone. It can be enhanced with retransmissions, however.

Colocation

Exchanges offer brokers the ability to colocate their servers in the same data center as the exchange.

This reduces the latency drastically and can be considered a VIP service.

Network Security

DDoS is a challenge for exchanges as there are some internet-facing services. Here’s our options:

Isolate public services and data from private services, so DDoS attacks don’t impact the most important clients
Use a caching layer to store data which is infrequently updated
Harden URLs against DDoS, eg prefer https://my.website.com/data/recent vs. https://my.website.com/data?from=123&to=456, because the former is more cacheable
Effective allowlist/blocklist mechanism is needed.
Rate limiting can be used to mitigate DDoS

Step 4 - Wrap Up

Other interesting notes:

not all exchanges rely on putting everything on one big server, but some still do
modern exchanges rely more on cloud infrastructure and also on automatic market makers (AMM) to avoid maintaining an order book

Chapter29 - world-stock-exchanges

Stock Exchange

Step 1 - Understand the Problem and Establish Design scope

Non-functional requirements

Back-of-the-envelope estimation

Step 2 - Propose High-Level Design and Get Buy-In

Business Knowledge 101

High-level design

Trading flow

Market data flow

Reporting flow

API Design

Data models

Product, order, execution

Order book

Candlestick chart

Step 3 - Design Deep Dive

Performance

Event sourcing

High availability

Fault tolerance

Matching algorithms

Determinism

Market data publisher optimizations

Distribution fairness of market data and multicast

Colocation

Network Security

Step 4 - Wrap Up

credit goes to @preslavmihaylov

Join Newsletter

Written by Upendra Follow