Day 2: Scalability & Performance

What You'll Learn Today

Vertical vs horizontal scaling and when to use each
Load balancing algorithms (Round Robin, Least Connections, Consistent Hashing)
Stateless vs stateful services and why stateless wins
The CAP theorem and its practical implications
Latency vs throughput trade-offs
Availability and SLAs (99.9%, 99.99%, 99.999%)

Vertical vs Horizontal Scaling

When your system needs to handle more traffic, you have two fundamental approaches.

flowchart TB
    subgraph Vertical["Vertical Scaling (Scale Up)"]
        V1["Small Server<br>4 CPU, 16 GB"]
        V2["Large Server<br>64 CPU, 512 GB"]
    end
    subgraph Horizontal["Horizontal Scaling (Scale Out)"]
        H1["Server 1"]
        H2["Server 2"]
        H3["Server 3"]
        H4["Server N..."]
    end
    V1 -->|"Upgrade"| V2
    style Vertical fill:#8b5cf6,color:#fff
    style Horizontal fill:#3b82f6,color:#fff
    style V1 fill:#8b5cf6,color:#fff
    style V2 fill:#8b5cf6,color:#fff
    style H1 fill:#3b82f6,color:#fff
    style H2 fill:#3b82f6,color:#fff
    style H3 fill:#3b82f6,color:#fff
    style H4 fill:#3b82f6,color:#fff

Aspect	Vertical Scaling	Horizontal Scaling
Approach	Bigger machine	More machines
Cost	Expensive (hardware limits)	Cheaper commodity servers
Complexity	Simple (single server)	Complex (distributed system)
Downtime	Requires downtime to upgrade	Zero downtime possible
Upper limit	Hardware ceiling	Virtually unlimited
Failure	Single point of failure	Fault tolerant
Best for	Small to medium scale	Large scale

In system design interviews, horizontal scaling is almost always the answer. Vertical scaling has a hard ceiling and introduces a single point of failure.

Load Balancing

A load balancer distributes incoming requests across multiple servers, ensuring no single server becomes a bottleneck.

flowchart TB
    C["Clients"]
    LB["Load Balancer"]
    S1["Server 1"]
    S2["Server 2"]
    S3["Server 3"]
    C --> LB
    LB --> S1
    LB --> S2
    LB --> S3
    style LB fill:#f59e0b,color:#fff
    style S1 fill:#3b82f6,color:#fff
    style S2 fill:#3b82f6,color:#fff
    style S3 fill:#3b82f6,color:#fff

Load Balancing Algorithms

Round Robin

Requests are distributed sequentially to each server in turn. Simple and works well when all servers have equal capacity.

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A  (cycle repeats)

Weighted Round Robin

Same as Round Robin but servers with more capacity receive proportionally more requests.

Server A (weight 3): gets 3 out of every 6 requests
Server B (weight 2): gets 2 out of every 6 requests
Server C (weight 1): gets 1 out of every 6 requests

Least Connections

Routes each request to the server with the fewest active connections. Ideal when requests have variable processing times.

IP Hash

Hashes the client IP to determine which server handles the request. Guarantees the same client always reaches the same server (useful for session affinity).

Consistent Hashing

Maps both servers and requests onto a hash ring. When a server is added or removed, only a fraction of requests are remapped. This is critical for distributed caching and database sharding.

flowchart TB
    subgraph Ring["Consistent Hash Ring"]
        direction TB
        N1["Node A<br>Position 0"]
        N2["Node B<br>Position 120"]
        N3["Node C<br>Position 240"]
    end
    K1["Key 1 → Node A"]
    K2["Key 2 → Node B"]
    K3["Key 3 → Node C"]
    style Ring fill:#22c55e,color:#fff
    style N1 fill:#3b82f6,color:#fff
    style N2 fill:#8b5cf6,color:#fff
    style N3 fill:#f59e0b,color:#fff

Algorithm	Pros	Cons	Best For
Round Robin	Simple, even distribution	Ignores server load	Equal-capacity servers
Weighted Round Robin	Accounts for capacity	Static weights	Mixed-capacity servers
Least Connections	Adapts to real load	More overhead to track	Variable request times
IP Hash	Session affinity	Uneven with few clients	Stateful sessions
Consistent Hashing	Minimal disruption on changes	Complex implementation	Caching, sharding

Where to Place Load Balancers

Load balancers can be placed at multiple layers:

Between clients and web servers (L7 - Application layer)
Between web servers and application servers (L4 - Transport layer)
Between application servers and databases

Stateless vs Stateful Services

This distinction fundamentally affects how you scale your system.

flowchart TB
    subgraph Stateful["Stateful Service"]
        SF1["Server A<br>User session stored here"]
        SF2["Server B<br>Different sessions"]
    end
    subgraph Stateless["Stateless Service"]
        SL1["Server A<br>No session data"]
        SL2["Server B<br>No session data"]
        Store["Shared Session Store<br>(Redis)"]
    end
    SL1 --> Store
    SL2 --> Store
    style Stateful fill:#ef4444,color:#fff
    style Stateless fill:#22c55e,color:#fff
    style Store fill:#f59e0b,color:#fff
    style SF1 fill:#ef4444,color:#fff
    style SF2 fill:#ef4444,color:#fff
    style SL1 fill:#22c55e,color:#fff
    style SL2 fill:#22c55e,color:#fff

Aspect	Stateful	Stateless
Session storage	On the server	External store (Redis, DB)
Scaling	Difficult (sticky sessions)	Easy (any server can handle any request)
Failure handling	Session lost if server dies	No data loss on server failure
Load balancing	Requires session affinity	Any algorithm works
Deployment	Complex (drain connections)	Simple (add/remove servers freely)

In interviews, always prefer stateless services. Move state to an external store like Redis or a database.

CAP Theorem

The CAP theorem states that a distributed system can only guarantee two out of three properties simultaneously.

flowchart TB
    subgraph CAP["CAP Theorem"]
        C["Consistency<br>All nodes see the same data"]
        A["Availability<br>Every request gets a response"]
        P["Partition Tolerance<br>System works despite network failures"]
    end
    C --- A
    A --- P
    P --- C
    style C fill:#3b82f6,color:#fff
    style A fill:#22c55e,color:#fff
    style P fill:#f59e0b,color:#fff

Since network partitions are inevitable in distributed systems, the real choice is between CP and AP:

Type	Guarantees	Sacrifices	Example Systems
CP	Consistency + Partition Tolerance	Availability during partition	MongoDB, HBase, Redis
AP	Availability + Partition Tolerance	Consistency during partition	Cassandra, DynamoDB, CouchDB
CA	Consistency + Availability	Cannot handle partitions	Single-node RDBMS (not distributed)

When to choose CP: Banking, inventory management, anything where stale data causes real harm.

When to choose AP: Social media feeds, product catalogs, anything where eventual consistency is acceptable.

Interview Tip: Always explain which side of CAP your design favors and why. The interviewer wants to see that you understand the trade-off.

Latency vs Throughput

These two metrics are related but measure different things.

Latency: Time to complete a single request (milliseconds)
Throughput: Number of requests processed per unit time (requests/second)

flowchart LR
    subgraph Latency["Low Latency"]
        L1["Request → 10ms → Response"]
    end
    subgraph Throughput["High Throughput"]
        T1["1000 requests"]
        T2["Processed in 1 second"]
    end
    style Latency fill:#3b82f6,color:#fff
    style Throughput fill:#8b5cf6,color:#fff

Common Latency Numbers

Operation	Latency
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
SSD random read	150 us
HDD random read	10 ms
Send packet within same datacenter	0.5 ms
Send packet from CA to Netherlands	150 ms

These numbers help you reason about where bottlenecks occur. Memory is 100,000x faster than disk. Network calls within a datacenter are 300x faster than cross-continent calls.

Improving Latency

Cache frequently accessed data
Use CDNs for static content
Minimize network hops
Use async processing for non-critical paths

Improving Throughput

Horizontal scaling (more servers)
Batch processing
Connection pooling
Asynchronous I/O

Availability and SLAs

Availability measures the percentage of time a system is operational.

Availability = Uptime / (Uptime + Downtime)

The Nines of Availability

Availability	Downtime/Year	Downtime/Month	Downtime/Day
99% (two nines)	3.65 days	7.3 hours	14.4 min
99.9% (three nines)	8.77 hours	43.8 min	1.44 min
99.99% (four nines)	52.6 min	4.38 min	8.64 sec
99.999% (five nines)	5.26 min	26.3 sec	864 ms

Calculating Availability of Combined Systems

Series (both must work):

A_total = A1 x A2
Example: 99.9% x 99.9% = 99.8%

Parallel (one must work):

A_total = 1 - (1 - A1) x (1 - A2)
Example: 1 - (0.001 x 0.001) = 99.9999%

flowchart TB
    subgraph Series["Series: Both Must Work"]
        S1["Service A<br>99.9%"]
        S2["Service B<br>99.9%"]
        S1 --> S2
    end
    subgraph Parallel["Parallel: One Must Work"]
        P1["Service A<br>99.9%"]
        P2["Service A<br>(replica) 99.9%"]
    end
    R1["Total: 99.8%"]
    R2["Total: 99.9999%"]
    Series --> R1
    Parallel --> R2
    style Series fill:#ef4444,color:#fff
    style Parallel fill:#22c55e,color:#fff
    style R1 fill:#ef4444,color:#fff
    style R2 fill:#22c55e,color:#fff
    style S1 fill:#ef4444,color:#fff
    style S2 fill:#ef4444,color:#fff
    style P1 fill:#22c55e,color:#fff
    style P2 fill:#22c55e,color:#fff

Key insight: Redundancy (parallel components) dramatically improves availability. This is why we use replicas, multiple load balancers, and multi-region deployments.

Putting It All Together

Here is a scalable web application architecture that applies the concepts from today.

flowchart TB
    Users["Users"]
    DNS["DNS"]
    CDN["CDN"]
    LB["Load Balancer"]
    WS1["Web Server 1"]
    WS2["Web Server 2"]
    WS3["Web Server 3"]
    Cache["Redis Cache"]
    DB_Primary["DB Primary"]
    DB_Replica1["DB Replica 1"]
    DB_Replica2["DB Replica 2"]

    Users --> DNS
    DNS --> CDN
    DNS --> LB
    LB --> WS1
    LB --> WS2
    LB --> WS3
    WS1 --> Cache
    WS2 --> Cache
    WS3 --> Cache
    Cache --> DB_Primary
    DB_Primary --> DB_Replica1
    DB_Primary --> DB_Replica2

    style LB fill:#f59e0b,color:#fff
    style WS1 fill:#3b82f6,color:#fff
    style WS2 fill:#3b82f6,color:#fff
    style WS3 fill:#3b82f6,color:#fff
    style Cache fill:#8b5cf6,color:#fff
    style DB_Primary fill:#22c55e,color:#fff
    style DB_Replica1 fill:#22c55e,color:#fff
    style DB_Replica2 fill:#22c55e,color:#fff
    style CDN fill:#f59e0b,color:#fff

This design is:

Horizontally scaled (multiple web servers)
Stateless (session data in Redis)
Highly available (replicated databases, load balancer)
AP-leaning (eventual consistency on read replicas is acceptable for most content)

Practice Problems

Exercise 1: Basics

A web application has three components in series: a load balancer (99.99%), an application server (99.95%), and a database (99.99%). Calculate the overall availability. Then calculate the availability if the application server is duplicated in parallel.

Exercise 2: Applied

Design a scalable web application for a blog platform with 10 million DAU. Specify:

The scaling strategy (vertical or horizontal and why)
The load balancing algorithm and why you chose it
Whether to use stateful or stateless servers
Which side of CAP you favor and why
Target availability and what redundancy you need to achieve it

Challenge

You have a system that must handle 100,000 QPS with a p99 latency of under 50ms. The current single-server setup handles 5,000 QPS at 30ms p99. Design a horizontally scaled architecture that meets the requirements. Include load balancing strategy, number of servers, caching approach, and availability analysis.

Summary

Concept	Description
Vertical Scaling	Bigger machine; simple but limited
Horizontal Scaling	More machines; complex but virtually unlimited
Load Balancer	Distributes traffic across servers
Consistent Hashing	Minimizes redistribution when nodes change
Stateless Service	No server-side session; easy to scale
CAP Theorem	Choose 2 of 3: Consistency, Availability, Partition Tolerance
Latency	Time to complete one request
Throughput	Requests processed per second
Availability	Percentage of uptime; improved by redundancy

Key Takeaways

Always design for horizontal scaling - vertical scaling has a ceiling
Stateless services are easier to scale - move state to external stores
Understand the CAP trade-off - know when to choose CP vs AP
Redundancy is the key to availability - parallel components dramatically reduce downtime
Know your latency numbers - they guide where to optimize

References

Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
Xu, Alex. System Design Interview - An Insider's Guide. Byte Code LLC, 2020.
Google SRE Book - Availability
Jeff Dean's Latency Numbers

Next up: On Day 3, we dive into Database Design and Scaling - covering SQL vs NoSQL, indexing strategies, replication, sharding, and how to model data for system design interviews.