Learn System Design in 10 DaysDay 2: Scalability & Performance
books.chapter 2Learn System Design in 10 Days

Day 2: Scalability & Performance

What You'll Learn Today

  • Vertical vs horizontal scaling and when to use each
  • Load balancing algorithms (Round Robin, Least Connections, Consistent Hashing)
  • Stateless vs stateful services and why stateless wins
  • The CAP theorem and its practical implications
  • Latency vs throughput trade-offs
  • Availability and SLAs (99.9%, 99.99%, 99.999%)

Vertical vs Horizontal Scaling

When your system needs to handle more traffic, you have two fundamental approaches.

flowchart TB
    subgraph Vertical["Vertical Scaling (Scale Up)"]
        V1["Small Server<br>4 CPU, 16 GB"]
        V2["Large Server<br>64 CPU, 512 GB"]
    end
    subgraph Horizontal["Horizontal Scaling (Scale Out)"]
        H1["Server 1"]
        H2["Server 2"]
        H3["Server 3"]
        H4["Server N..."]
    end
    V1 -->|"Upgrade"| V2
    style Vertical fill:#8b5cf6,color:#fff
    style Horizontal fill:#3b82f6,color:#fff
    style V1 fill:#8b5cf6,color:#fff
    style V2 fill:#8b5cf6,color:#fff
    style H1 fill:#3b82f6,color:#fff
    style H2 fill:#3b82f6,color:#fff
    style H3 fill:#3b82f6,color:#fff
    style H4 fill:#3b82f6,color:#fff
Aspect Vertical Scaling Horizontal Scaling
Approach Bigger machine More machines
Cost Expensive (hardware limits) Cheaper commodity servers
Complexity Simple (single server) Complex (distributed system)
Downtime Requires downtime to upgrade Zero downtime possible
Upper limit Hardware ceiling Virtually unlimited
Failure Single point of failure Fault tolerant
Best for Small to medium scale Large scale

In system design interviews, horizontal scaling is almost always the answer. Vertical scaling has a hard ceiling and introduces a single point of failure.


Load Balancing

A load balancer distributes incoming requests across multiple servers, ensuring no single server becomes a bottleneck.

flowchart TB
    C["Clients"]
    LB["Load Balancer"]
    S1["Server 1"]
    S2["Server 2"]
    S3["Server 3"]
    C --> LB
    LB --> S1
    LB --> S2
    LB --> S3
    style LB fill:#f59e0b,color:#fff
    style S1 fill:#3b82f6,color:#fff
    style S2 fill:#3b82f6,color:#fff
    style S3 fill:#3b82f6,color:#fff

Load Balancing Algorithms

Round Robin

Requests are distributed sequentially to each server in turn. Simple and works well when all servers have equal capacity.

Request 1 β†’ Server A
Request 2 β†’ Server B
Request 3 β†’ Server C
Request 4 β†’ Server A  (cycle repeats)

Weighted Round Robin

Same as Round Robin but servers with more capacity receive proportionally more requests.

Server A (weight 3): gets 3 out of every 6 requests
Server B (weight 2): gets 2 out of every 6 requests
Server C (weight 1): gets 1 out of every 6 requests

Least Connections

Routes each request to the server with the fewest active connections. Ideal when requests have variable processing times.

IP Hash

Hashes the client IP to determine which server handles the request. Guarantees the same client always reaches the same server (useful for session affinity).

Consistent Hashing

Maps both servers and requests onto a hash ring. When a server is added or removed, only a fraction of requests are remapped. This is critical for distributed caching and database sharding.

flowchart TB
    subgraph Ring["Consistent Hash Ring"]
        direction TB
        N1["Node A<br>Position 0"]
        N2["Node B<br>Position 120"]
        N3["Node C<br>Position 240"]
    end
    K1["Key 1 β†’ Node A"]
    K2["Key 2 β†’ Node B"]
    K3["Key 3 β†’ Node C"]
    style Ring fill:#22c55e,color:#fff
    style N1 fill:#3b82f6,color:#fff
    style N2 fill:#8b5cf6,color:#fff
    style N3 fill:#f59e0b,color:#fff
Algorithm Pros Cons Best For
Round Robin Simple, even distribution Ignores server load Equal-capacity servers
Weighted Round Robin Accounts for capacity Static weights Mixed-capacity servers
Least Connections Adapts to real load More overhead to track Variable request times
IP Hash Session affinity Uneven with few clients Stateful sessions
Consistent Hashing Minimal disruption on changes Complex implementation Caching, sharding

Where to Place Load Balancers

Load balancers can be placed at multiple layers:

  1. Between clients and web servers (L7 - Application layer)
  2. Between web servers and application servers (L4 - Transport layer)
  3. Between application servers and databases

Stateless vs Stateful Services

This distinction fundamentally affects how you scale your system.

flowchart TB
    subgraph Stateful["Stateful Service"]
        SF1["Server A<br>User session stored here"]
        SF2["Server B<br>Different sessions"]
    end
    subgraph Stateless["Stateless Service"]
        SL1["Server A<br>No session data"]
        SL2["Server B<br>No session data"]
        Store["Shared Session Store<br>(Redis)"]
    end
    SL1 --> Store
    SL2 --> Store
    style Stateful fill:#ef4444,color:#fff
    style Stateless fill:#22c55e,color:#fff
    style Store fill:#f59e0b,color:#fff
    style SF1 fill:#ef4444,color:#fff
    style SF2 fill:#ef4444,color:#fff
    style SL1 fill:#22c55e,color:#fff
    style SL2 fill:#22c55e,color:#fff
Aspect Stateful Stateless
Session storage On the server External store (Redis, DB)
Scaling Difficult (sticky sessions) Easy (any server can handle any request)
Failure handling Session lost if server dies No data loss on server failure
Load balancing Requires session affinity Any algorithm works
Deployment Complex (drain connections) Simple (add/remove servers freely)

In interviews, always prefer stateless services. Move state to an external store like Redis or a database.


CAP Theorem

The CAP theorem states that a distributed system can only guarantee two out of three properties simultaneously.

flowchart TB
    subgraph CAP["CAP Theorem"]
        C["Consistency<br>All nodes see the same data"]
        A["Availability<br>Every request gets a response"]
        P["Partition Tolerance<br>System works despite network failures"]
    end
    C --- A
    A --- P
    P --- C
    style C fill:#3b82f6,color:#fff
    style A fill:#22c55e,color:#fff
    style P fill:#f59e0b,color:#fff

Since network partitions are inevitable in distributed systems, the real choice is between CP and AP:

Type Guarantees Sacrifices Example Systems
CP Consistency + Partition Tolerance Availability during partition MongoDB, HBase, Redis
AP Availability + Partition Tolerance Consistency during partition Cassandra, DynamoDB, CouchDB
CA Consistency + Availability Cannot handle partitions Single-node RDBMS (not distributed)

When to choose CP: Banking, inventory management, anything where stale data causes real harm.

When to choose AP: Social media feeds, product catalogs, anything where eventual consistency is acceptable.

Interview Tip: Always explain which side of CAP your design favors and why. The interviewer wants to see that you understand the trade-off.


Latency vs Throughput

These two metrics are related but measure different things.

  • Latency: Time to complete a single request (milliseconds)
  • Throughput: Number of requests processed per unit time (requests/second)
flowchart LR
    subgraph Latency["Low Latency"]
        L1["Request β†’ 10ms β†’ Response"]
    end
    subgraph Throughput["High Throughput"]
        T1["1000 requests"]
        T2["Processed in 1 second"]
    end
    style Latency fill:#3b82f6,color:#fff
    style Throughput fill:#8b5cf6,color:#fff

Common Latency Numbers

Operation Latency
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
SSD random read 150 us
HDD random read 10 ms
Send packet within same datacenter 0.5 ms
Send packet from CA to Netherlands 150 ms

These numbers help you reason about where bottlenecks occur. Memory is 100,000x faster than disk. Network calls within a datacenter are 300x faster than cross-continent calls.

Improving Latency

  • Cache frequently accessed data
  • Use CDNs for static content
  • Minimize network hops
  • Use async processing for non-critical paths

Improving Throughput

  • Horizontal scaling (more servers)
  • Batch processing
  • Connection pooling
  • Asynchronous I/O

Availability and SLAs

Availability measures the percentage of time a system is operational.

Availability = Uptime / (Uptime + Downtime)

The Nines of Availability

Availability Downtime/Year Downtime/Month Downtime/Day
99% (two nines) 3.65 days 7.3 hours 14.4 min
99.9% (three nines) 8.77 hours 43.8 min 1.44 min
99.99% (four nines) 52.6 min 4.38 min 8.64 sec
99.999% (five nines) 5.26 min 26.3 sec 864 ms

Calculating Availability of Combined Systems

Series (both must work):

A_total = A1 x A2
Example: 99.9% x 99.9% = 99.8%

Parallel (one must work):

A_total = 1 - (1 - A1) x (1 - A2)
Example: 1 - (0.001 x 0.001) = 99.9999%
flowchart TB
    subgraph Series["Series: Both Must Work"]
        S1["Service A<br>99.9%"]
        S2["Service B<br>99.9%"]
        S1 --> S2
    end
    subgraph Parallel["Parallel: One Must Work"]
        P1["Service A<br>99.9%"]
        P2["Service A<br>(replica) 99.9%"]
    end
    R1["Total: 99.8%"]
    R2["Total: 99.9999%"]
    Series --> R1
    Parallel --> R2
    style Series fill:#ef4444,color:#fff
    style Parallel fill:#22c55e,color:#fff
    style R1 fill:#ef4444,color:#fff
    style R2 fill:#22c55e,color:#fff
    style S1 fill:#ef4444,color:#fff
    style S2 fill:#ef4444,color:#fff
    style P1 fill:#22c55e,color:#fff
    style P2 fill:#22c55e,color:#fff

Key insight: Redundancy (parallel components) dramatically improves availability. This is why we use replicas, multiple load balancers, and multi-region deployments.


Putting It All Together

Here is a scalable web application architecture that applies the concepts from today.

flowchart TB
    Users["Users"]
    DNS["DNS"]
    CDN["CDN"]
    LB["Load Balancer"]
    WS1["Web Server 1"]
    WS2["Web Server 2"]
    WS3["Web Server 3"]
    Cache["Redis Cache"]
    DB_Primary["DB Primary"]
    DB_Replica1["DB Replica 1"]
    DB_Replica2["DB Replica 2"]

    Users --> DNS
    DNS --> CDN
    DNS --> LB
    LB --> WS1
    LB --> WS2
    LB --> WS3
    WS1 --> Cache
    WS2 --> Cache
    WS3 --> Cache
    Cache --> DB_Primary
    DB_Primary --> DB_Replica1
    DB_Primary --> DB_Replica2

    style LB fill:#f59e0b,color:#fff
    style WS1 fill:#3b82f6,color:#fff
    style WS2 fill:#3b82f6,color:#fff
    style WS3 fill:#3b82f6,color:#fff
    style Cache fill:#8b5cf6,color:#fff
    style DB_Primary fill:#22c55e,color:#fff
    style DB_Replica1 fill:#22c55e,color:#fff
    style DB_Replica2 fill:#22c55e,color:#fff
    style CDN fill:#f59e0b,color:#fff

This design is:

  • Horizontally scaled (multiple web servers)
  • Stateless (session data in Redis)
  • Highly available (replicated databases, load balancer)
  • AP-leaning (eventual consistency on read replicas is acceptable for most content)

Practice Problems

Exercise 1: Basics

A web application has three components in series: a load balancer (99.99%), an application server (99.95%), and a database (99.99%). Calculate the overall availability. Then calculate the availability if the application server is duplicated in parallel.

Exercise 2: Applied

Design a scalable web application for a blog platform with 10 million DAU. Specify:

  1. The scaling strategy (vertical or horizontal and why)
  2. The load balancing algorithm and why you chose it
  3. Whether to use stateful or stateless servers
  4. Which side of CAP you favor and why
  5. Target availability and what redundancy you need to achieve it

Challenge

You have a system that must handle 100,000 QPS with a p99 latency of under 50ms. The current single-server setup handles 5,000 QPS at 30ms p99. Design a horizontally scaled architecture that meets the requirements. Include load balancing strategy, number of servers, caching approach, and availability analysis.


Summary

Concept Description
Vertical Scaling Bigger machine; simple but limited
Horizontal Scaling More machines; complex but virtually unlimited
Load Balancer Distributes traffic across servers
Consistent Hashing Minimizes redistribution when nodes change
Stateless Service No server-side session; easy to scale
CAP Theorem Choose 2 of 3: Consistency, Availability, Partition Tolerance
Latency Time to complete one request
Throughput Requests processed per second
Availability Percentage of uptime; improved by redundancy

Key Takeaways

  1. Always design for horizontal scaling - vertical scaling has a ceiling
  2. Stateless services are easier to scale - move state to external stores
  3. Understand the CAP trade-off - know when to choose CP vs AP
  4. Redundancy is the key to availability - parallel components dramatically reduce downtime
  5. Know your latency numbers - they guide where to optimize

References


Next up: On Day 3, we dive into Database Design and Scaling - covering SQL vs NoSQL, indexing strategies, replication, sharding, and how to model data for system design interviews.