Day 2: Scalability & Performance
What You'll Learn Today
- Vertical vs horizontal scaling and when to use each
- Load balancing algorithms (Round Robin, Least Connections, Consistent Hashing)
- Stateless vs stateful services and why stateless wins
- The CAP theorem and its practical implications
- Latency vs throughput trade-offs
- Availability and SLAs (99.9%, 99.99%, 99.999%)
Vertical vs Horizontal Scaling
When your system needs to handle more traffic, you have two fundamental approaches.
flowchart TB
subgraph Vertical["Vertical Scaling (Scale Up)"]
V1["Small Server<br>4 CPU, 16 GB"]
V2["Large Server<br>64 CPU, 512 GB"]
end
subgraph Horizontal["Horizontal Scaling (Scale Out)"]
H1["Server 1"]
H2["Server 2"]
H3["Server 3"]
H4["Server N..."]
end
V1 -->|"Upgrade"| V2
style Vertical fill:#8b5cf6,color:#fff
style Horizontal fill:#3b82f6,color:#fff
style V1 fill:#8b5cf6,color:#fff
style V2 fill:#8b5cf6,color:#fff
style H1 fill:#3b82f6,color:#fff
style H2 fill:#3b82f6,color:#fff
style H3 fill:#3b82f6,color:#fff
style H4 fill:#3b82f6,color:#fff
| Aspect | Vertical Scaling | Horizontal Scaling |
|---|---|---|
| Approach | Bigger machine | More machines |
| Cost | Expensive (hardware limits) | Cheaper commodity servers |
| Complexity | Simple (single server) | Complex (distributed system) |
| Downtime | Requires downtime to upgrade | Zero downtime possible |
| Upper limit | Hardware ceiling | Virtually unlimited |
| Failure | Single point of failure | Fault tolerant |
| Best for | Small to medium scale | Large scale |
In system design interviews, horizontal scaling is almost always the answer. Vertical scaling has a hard ceiling and introduces a single point of failure.
Load Balancing
A load balancer distributes incoming requests across multiple servers, ensuring no single server becomes a bottleneck.
flowchart TB
C["Clients"]
LB["Load Balancer"]
S1["Server 1"]
S2["Server 2"]
S3["Server 3"]
C --> LB
LB --> S1
LB --> S2
LB --> S3
style LB fill:#f59e0b,color:#fff
style S1 fill:#3b82f6,color:#fff
style S2 fill:#3b82f6,color:#fff
style S3 fill:#3b82f6,color:#fff
Load Balancing Algorithms
Round Robin
Requests are distributed sequentially to each server in turn. Simple and works well when all servers have equal capacity.
Request 1 β Server A
Request 2 β Server B
Request 3 β Server C
Request 4 β Server A (cycle repeats)
Weighted Round Robin
Same as Round Robin but servers with more capacity receive proportionally more requests.
Server A (weight 3): gets 3 out of every 6 requests
Server B (weight 2): gets 2 out of every 6 requests
Server C (weight 1): gets 1 out of every 6 requests
Least Connections
Routes each request to the server with the fewest active connections. Ideal when requests have variable processing times.
IP Hash
Hashes the client IP to determine which server handles the request. Guarantees the same client always reaches the same server (useful for session affinity).
Consistent Hashing
Maps both servers and requests onto a hash ring. When a server is added or removed, only a fraction of requests are remapped. This is critical for distributed caching and database sharding.
flowchart TB
subgraph Ring["Consistent Hash Ring"]
direction TB
N1["Node A<br>Position 0"]
N2["Node B<br>Position 120"]
N3["Node C<br>Position 240"]
end
K1["Key 1 β Node A"]
K2["Key 2 β Node B"]
K3["Key 3 β Node C"]
style Ring fill:#22c55e,color:#fff
style N1 fill:#3b82f6,color:#fff
style N2 fill:#8b5cf6,color:#fff
style N3 fill:#f59e0b,color:#fff
| Algorithm | Pros | Cons | Best For |
|---|---|---|---|
| Round Robin | Simple, even distribution | Ignores server load | Equal-capacity servers |
| Weighted Round Robin | Accounts for capacity | Static weights | Mixed-capacity servers |
| Least Connections | Adapts to real load | More overhead to track | Variable request times |
| IP Hash | Session affinity | Uneven with few clients | Stateful sessions |
| Consistent Hashing | Minimal disruption on changes | Complex implementation | Caching, sharding |
Where to Place Load Balancers
Load balancers can be placed at multiple layers:
- Between clients and web servers (L7 - Application layer)
- Between web servers and application servers (L4 - Transport layer)
- Between application servers and databases
Stateless vs Stateful Services
This distinction fundamentally affects how you scale your system.
flowchart TB
subgraph Stateful["Stateful Service"]
SF1["Server A<br>User session stored here"]
SF2["Server B<br>Different sessions"]
end
subgraph Stateless["Stateless Service"]
SL1["Server A<br>No session data"]
SL2["Server B<br>No session data"]
Store["Shared Session Store<br>(Redis)"]
end
SL1 --> Store
SL2 --> Store
style Stateful fill:#ef4444,color:#fff
style Stateless fill:#22c55e,color:#fff
style Store fill:#f59e0b,color:#fff
style SF1 fill:#ef4444,color:#fff
style SF2 fill:#ef4444,color:#fff
style SL1 fill:#22c55e,color:#fff
style SL2 fill:#22c55e,color:#fff
| Aspect | Stateful | Stateless |
|---|---|---|
| Session storage | On the server | External store (Redis, DB) |
| Scaling | Difficult (sticky sessions) | Easy (any server can handle any request) |
| Failure handling | Session lost if server dies | No data loss on server failure |
| Load balancing | Requires session affinity | Any algorithm works |
| Deployment | Complex (drain connections) | Simple (add/remove servers freely) |
In interviews, always prefer stateless services. Move state to an external store like Redis or a database.
CAP Theorem
The CAP theorem states that a distributed system can only guarantee two out of three properties simultaneously.
flowchart TB
subgraph CAP["CAP Theorem"]
C["Consistency<br>All nodes see the same data"]
A["Availability<br>Every request gets a response"]
P["Partition Tolerance<br>System works despite network failures"]
end
C --- A
A --- P
P --- C
style C fill:#3b82f6,color:#fff
style A fill:#22c55e,color:#fff
style P fill:#f59e0b,color:#fff
Since network partitions are inevitable in distributed systems, the real choice is between CP and AP:
| Type | Guarantees | Sacrifices | Example Systems |
|---|---|---|---|
| CP | Consistency + Partition Tolerance | Availability during partition | MongoDB, HBase, Redis |
| AP | Availability + Partition Tolerance | Consistency during partition | Cassandra, DynamoDB, CouchDB |
| CA | Consistency + Availability | Cannot handle partitions | Single-node RDBMS (not distributed) |
When to choose CP: Banking, inventory management, anything where stale data causes real harm.
When to choose AP: Social media feeds, product catalogs, anything where eventual consistency is acceptable.
Interview Tip: Always explain which side of CAP your design favors and why. The interviewer wants to see that you understand the trade-off.
Latency vs Throughput
These two metrics are related but measure different things.
- Latency: Time to complete a single request (milliseconds)
- Throughput: Number of requests processed per unit time (requests/second)
flowchart LR
subgraph Latency["Low Latency"]
L1["Request β 10ms β Response"]
end
subgraph Throughput["High Throughput"]
T1["1000 requests"]
T2["Processed in 1 second"]
end
style Latency fill:#3b82f6,color:#fff
style Throughput fill:#8b5cf6,color:#fff
Common Latency Numbers
| Operation | Latency |
|---|---|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 7 ns |
| Main memory reference | 100 ns |
| SSD random read | 150 us |
| HDD random read | 10 ms |
| Send packet within same datacenter | 0.5 ms |
| Send packet from CA to Netherlands | 150 ms |
These numbers help you reason about where bottlenecks occur. Memory is 100,000x faster than disk. Network calls within a datacenter are 300x faster than cross-continent calls.
Improving Latency
- Cache frequently accessed data
- Use CDNs for static content
- Minimize network hops
- Use async processing for non-critical paths
Improving Throughput
- Horizontal scaling (more servers)
- Batch processing
- Connection pooling
- Asynchronous I/O
Availability and SLAs
Availability measures the percentage of time a system is operational.
Availability = Uptime / (Uptime + Downtime)
The Nines of Availability
| Availability | Downtime/Year | Downtime/Month | Downtime/Day |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | 14.4 min |
| 99.9% (three nines) | 8.77 hours | 43.8 min | 1.44 min |
| 99.99% (four nines) | 52.6 min | 4.38 min | 8.64 sec |
| 99.999% (five nines) | 5.26 min | 26.3 sec | 864 ms |
Calculating Availability of Combined Systems
Series (both must work):
A_total = A1 x A2
Example: 99.9% x 99.9% = 99.8%
Parallel (one must work):
A_total = 1 - (1 - A1) x (1 - A2)
Example: 1 - (0.001 x 0.001) = 99.9999%
flowchart TB
subgraph Series["Series: Both Must Work"]
S1["Service A<br>99.9%"]
S2["Service B<br>99.9%"]
S1 --> S2
end
subgraph Parallel["Parallel: One Must Work"]
P1["Service A<br>99.9%"]
P2["Service A<br>(replica) 99.9%"]
end
R1["Total: 99.8%"]
R2["Total: 99.9999%"]
Series --> R1
Parallel --> R2
style Series fill:#ef4444,color:#fff
style Parallel fill:#22c55e,color:#fff
style R1 fill:#ef4444,color:#fff
style R2 fill:#22c55e,color:#fff
style S1 fill:#ef4444,color:#fff
style S2 fill:#ef4444,color:#fff
style P1 fill:#22c55e,color:#fff
style P2 fill:#22c55e,color:#fff
Key insight: Redundancy (parallel components) dramatically improves availability. This is why we use replicas, multiple load balancers, and multi-region deployments.
Putting It All Together
Here is a scalable web application architecture that applies the concepts from today.
flowchart TB
Users["Users"]
DNS["DNS"]
CDN["CDN"]
LB["Load Balancer"]
WS1["Web Server 1"]
WS2["Web Server 2"]
WS3["Web Server 3"]
Cache["Redis Cache"]
DB_Primary["DB Primary"]
DB_Replica1["DB Replica 1"]
DB_Replica2["DB Replica 2"]
Users --> DNS
DNS --> CDN
DNS --> LB
LB --> WS1
LB --> WS2
LB --> WS3
WS1 --> Cache
WS2 --> Cache
WS3 --> Cache
Cache --> DB_Primary
DB_Primary --> DB_Replica1
DB_Primary --> DB_Replica2
style LB fill:#f59e0b,color:#fff
style WS1 fill:#3b82f6,color:#fff
style WS2 fill:#3b82f6,color:#fff
style WS3 fill:#3b82f6,color:#fff
style Cache fill:#8b5cf6,color:#fff
style DB_Primary fill:#22c55e,color:#fff
style DB_Replica1 fill:#22c55e,color:#fff
style DB_Replica2 fill:#22c55e,color:#fff
style CDN fill:#f59e0b,color:#fff
This design is:
- Horizontally scaled (multiple web servers)
- Stateless (session data in Redis)
- Highly available (replicated databases, load balancer)
- AP-leaning (eventual consistency on read replicas is acceptable for most content)
Practice Problems
Exercise 1: Basics
A web application has three components in series: a load balancer (99.99%), an application server (99.95%), and a database (99.99%). Calculate the overall availability. Then calculate the availability if the application server is duplicated in parallel.
Exercise 2: Applied
Design a scalable web application for a blog platform with 10 million DAU. Specify:
- The scaling strategy (vertical or horizontal and why)
- The load balancing algorithm and why you chose it
- Whether to use stateful or stateless servers
- Which side of CAP you favor and why
- Target availability and what redundancy you need to achieve it
Challenge
You have a system that must handle 100,000 QPS with a p99 latency of under 50ms. The current single-server setup handles 5,000 QPS at 30ms p99. Design a horizontally scaled architecture that meets the requirements. Include load balancing strategy, number of servers, caching approach, and availability analysis.
Summary
| Concept | Description |
|---|---|
| Vertical Scaling | Bigger machine; simple but limited |
| Horizontal Scaling | More machines; complex but virtually unlimited |
| Load Balancer | Distributes traffic across servers |
| Consistent Hashing | Minimizes redistribution when nodes change |
| Stateless Service | No server-side session; easy to scale |
| CAP Theorem | Choose 2 of 3: Consistency, Availability, Partition Tolerance |
| Latency | Time to complete one request |
| Throughput | Requests processed per second |
| Availability | Percentage of uptime; improved by redundancy |
Key Takeaways
- Always design for horizontal scaling - vertical scaling has a ceiling
- Stateless services are easier to scale - move state to external stores
- Understand the CAP trade-off - know when to choose CP vs AP
- Redundancy is the key to availability - parallel components dramatically reduce downtime
- Know your latency numbers - they guide where to optimize
References
- Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
- Xu, Alex. System Design Interview - An Insider's Guide. Byte Code LLC, 2020.
- Google SRE Book - Availability
- Jeff Dean's Latency Numbers
Next up: On Day 3, we dive into Database Design and Scaling - covering SQL vs NoSQL, indexing strategies, replication, sharding, and how to model data for system design interviews.