Learn System Design in 10 DaysDay 3: Database Design & Scaling
books.chapter 3Learn System Design in 10 Days

Day 3: Database Design & Scaling

What You'll Learn Today

  • SQL vs NoSQL databases and when to use each
  • How database indexing works and why it matters
  • Replication strategies (leader-follower, multi-leader)
  • Sharding and partitioning (hash, range, geographic)
  • ACID vs BASE consistency models
  • Data modeling techniques for system design interviews

SQL vs NoSQL

Choosing the right database is one of the most important decisions in system design. The choice depends on your data model, query patterns, and scale requirements.

flowchart TB
    subgraph SQL["SQL (Relational)"]
        S1["Structured Data"]
        S2["ACID Transactions"]
        S3["Complex Queries (JOIN)"]
    end
    subgraph NoSQL["NoSQL (Non-Relational)"]
        N1["Flexible Schema"]
        N2["Horizontal Scaling"]
        N3["High Throughput"]
    end
    style SQL fill:#3b82f6,color:#fff
    style NoSQL fill:#8b5cf6,color:#fff
    style S1 fill:#3b82f6,color:#fff
    style S2 fill:#3b82f6,color:#fff
    style S3 fill:#3b82f6,color:#fff
    style N1 fill:#8b5cf6,color:#fff
    style N2 fill:#8b5cf6,color:#fff
    style N3 fill:#8b5cf6,color:#fff
Aspect SQL NoSQL
Data model Tables with rows and columns Documents, key-value, wide-column, graph
Schema Fixed schema (predefined) Dynamic schema (flexible)
Scaling Primarily vertical Designed for horizontal
Transactions Strong ACID support Limited (varies by system)
Joins Native support Typically no joins
Query language SQL (standardized) Database-specific APIs
Best for Complex relationships, transactions High throughput, flexible data

Types of NoSQL Databases

Type Examples Best For Data Model
Key-Value Redis, DynamoDB Caching, sessions Key β†’ Value
Document MongoDB, CouchDB Content management, catalogs Key β†’ JSON document
Wide-Column Cassandra, HBase Time-series, IoT, analytics Row key β†’ Column families
Graph Neo4j, Amazon Neptune Social networks, recommendations Nodes + Edges

When to Choose What

Choose SQL when:

  • Data has clear relationships (users, orders, products)
  • You need ACID transactions (banking, inventory)
  • You need complex queries with JOINs
  • Data schema is stable

Choose NoSQL when:

  • Schema changes frequently
  • You need massive horizontal scale
  • Data is denormalized or semi-structured
  • Low-latency reads/writes are critical

Interview Tip: Many real systems use both. For example, use PostgreSQL for user accounts (relational, transactional) and Cassandra for activity feeds (high write throughput, time-series).


Database Indexing

An index is a data structure that speeds up data retrieval at the cost of additional storage and slower writes.

flowchart LR
    subgraph Without["Without Index"]
        W1["Scan ALL rows<br>O(n)"]
    end
    subgraph With["With Index (B-Tree)"]
        I1["Binary search<br>O(log n)"]
    end
    style Without fill:#ef4444,color:#fff
    style With fill:#22c55e,color:#fff
    style W1 fill:#ef4444,color:#fff
    style I1 fill:#22c55e,color:#fff

How B-Tree Indexes Work

Most SQL databases use B-Tree indexes. The data is organized in a balanced tree where:

  • Each node contains sorted keys
  • Leaf nodes point to the actual data rows
  • Lookups, inserts, and deletes are O(log n)

Types of Indexes

Index Type Description Use Case
Primary Unique, auto-created on primary key Row identification
Secondary Created on non-primary columns Speeding up queries on any column
Composite Index on multiple columns Queries filtering on multiple fields
Unique Enforces uniqueness Email, username fields
Full-text Optimized for text search Search functionality

Indexing Trade-offs

Benefit Cost
Faster reads (O(log n) vs O(n)) Slower writes (index must be updated)
Efficient sorting Additional storage space
Quick lookups More indexes = more maintenance

Rule of thumb: Index columns that appear in WHERE, JOIN, and ORDER BY clauses. Do not over-index - every index slows down writes.


Replication

Replication copies data across multiple servers for fault tolerance and read scalability.

Leader-Follower (Master-Slave) Replication

The most common replication strategy. One leader handles all writes; followers replicate the data and serve reads.

flowchart TB
    App["Application"]
    Leader["Leader (Primary)<br>Reads + Writes"]
    F1["Follower 1<br>Reads only"]
    F2["Follower 2<br>Reads only"]
    F3["Follower 3<br>Reads only"]
    App -->|"Writes"| Leader
    App -->|"Reads"| F1
    App -->|"Reads"| F2
    App -->|"Reads"| F3
    Leader -->|"Replication"| F1
    Leader -->|"Replication"| F2
    Leader -->|"Replication"| F3
    style Leader fill:#f59e0b,color:#fff
    style F1 fill:#3b82f6,color:#fff
    style F2 fill:#3b82f6,color:#fff
    style F3 fill:#3b82f6,color:#fff

Synchronous replication: Leader waits for follower confirmation before acknowledging the write. Strong consistency but higher latency.

Asynchronous replication: Leader acknowledges the write immediately. Lower latency but risk of data loss if the leader fails.

Multi-Leader Replication

Multiple nodes accept writes. Used for multi-datacenter deployments.

flowchart TB
    subgraph DC1["Datacenter 1"]
        L1["Leader 1"]
    end
    subgraph DC2["Datacenter 2"]
        L2["Leader 2"]
    end
    L1 <-->|"Bi-directional<br>replication"| L2
    style DC1 fill:#3b82f6,color:#fff
    style DC2 fill:#8b5cf6,color:#fff
    style L1 fill:#3b82f6,color:#fff
    style L2 fill:#8b5cf6,color:#fff
Strategy Writes Consistency Complexity Use Case
Leader-Follower Single leader Strong (sync) or eventual (async) Low Most applications
Multi-Leader Multiple leaders Eventual (conflict resolution needed) High Multi-datacenter
Leaderless Any node Eventual (quorum reads/writes) High Cassandra, DynamoDB

Sharding (Partitioning)

Sharding splits data across multiple database instances. Each shard holds a subset of the data.

flowchart TB
    App["Application"]
    Router["Shard Router"]
    S1["Shard 1<br>Users A-H"]
    S2["Shard 2<br>Users I-P"]
    S3["Shard 3<br>Users Q-Z"]
    App --> Router
    Router --> S1
    Router --> S2
    Router --> S3
    style Router fill:#f59e0b,color:#fff
    style S1 fill:#3b82f6,color:#fff
    style S2 fill:#22c55e,color:#fff
    style S3 fill:#8b5cf6,color:#fff

Sharding Strategies

Hash-Based Sharding

shard_id = hash(key) % number_of_shards

Distributes data evenly but makes range queries difficult. Adding/removing shards requires rehashing (use consistent hashing to mitigate).

Range-Based Sharding

Data is split by ranges (e.g., A-H, I-P, Q-Z or by date). Simple and supports range queries but can create hotspots if data is not evenly distributed.

Geographic Sharding

Data is partitioned by geographic region. Users in Japan access the Tokyo shard; users in the US access the Virginia shard. Reduces latency but adds complexity for cross-region queries.

Strategy Distribution Range Queries Hotspot Risk Re-sharding
Hash Even Difficult Low Use consistent hashing
Range Depends on data Easy High Move ranges
Geographic By region Within region only Moderate Add regions

Challenges of Sharding

  1. Cross-shard queries: JOINs across shards are expensive
  2. Rebalancing: Adding shards requires data migration
  3. Hotspots: Uneven distribution can overload one shard
  4. Referential integrity: Foreign keys across shards are not supported
  5. Operational complexity: More databases to manage

Interview Tip: Only introduce sharding when a single database cannot handle the load. Sharding adds significant complexity.


ACID vs BASE

These are two different consistency models for databases.

ACID (SQL databases)

Property Description
Atomicity All operations in a transaction succeed or all fail
Consistency Data always moves from one valid state to another
Isolation Concurrent transactions do not interfere
Durability Committed data survives crashes

BASE (NoSQL databases)

Property Description
Basically Available System guarantees availability
Soft state State may change over time without input
Eventual consistency System will become consistent eventually
flowchart LR
    subgraph ACID_Model["ACID"]
        A1["Strong Consistency"]
        A2["Transactions"]
        A3["Lower Throughput"]
    end
    subgraph BASE_Model["BASE"]
        B1["Eventual Consistency"]
        B2["High Availability"]
        B3["Higher Throughput"]
    end
    style ACID_Model fill:#3b82f6,color:#fff
    style BASE_Model fill:#22c55e,color:#fff
    style A1 fill:#3b82f6,color:#fff
    style A2 fill:#3b82f6,color:#fff
    style A3 fill:#3b82f6,color:#fff
    style B1 fill:#22c55e,color:#fff
    style B2 fill:#22c55e,color:#fff
    style B3 fill:#22c55e,color:#fff

Choose ACID for financial transactions, inventory management, user authentication. Choose BASE for social media feeds, analytics, product recommendations.


Data Modeling for Interviews

When designing a database schema in an interview, follow this process:

  1. Identify entities (users, products, orders)
  2. Define relationships (one-to-one, one-to-many, many-to-many)
  3. Choose primary keys (auto-increment ID or UUID)
  4. Decide on normalization vs denormalization
  5. Plan indexes based on query patterns

Normalization vs Denormalization

Approach Pros Cons
Normalized No data redundancy, easy updates Requires JOINs, slower reads
Denormalized Fast reads, no JOINs Data redundancy, complex updates

In system design interviews, read-heavy systems often benefit from denormalization. Write the data in a format that is optimized for the most common read query.

Example: E-Commerce Schema

Users
β”œβ”€β”€ user_id (PK)
β”œβ”€β”€ name
β”œβ”€β”€ email (UNIQUE INDEX)
└── created_at

Products
β”œβ”€β”€ product_id (PK)
β”œβ”€β”€ name
β”œβ”€β”€ price
β”œβ”€β”€ category (INDEX)
└── inventory_count

Orders
β”œβ”€β”€ order_id (PK)
β”œβ”€β”€ user_id (FK β†’ Users, INDEX)
β”œβ”€β”€ total_amount
β”œβ”€β”€ status (INDEX)
└── created_at (INDEX)

Order_Items
β”œβ”€β”€ item_id (PK)
β”œβ”€β”€ order_id (FK β†’ Orders, INDEX)
β”œβ”€β”€ product_id (FK β†’ Products)
β”œβ”€β”€ quantity
└── price_at_purchase

Practice Problems

Exercise 1: Basics

For each scenario, decide whether SQL or NoSQL is more appropriate and explain why:

  1. A banking application that processes transfers between accounts
  2. A real-time analytics dashboard tracking millions of events per second
  3. A social network storing user profiles and friend relationships

Exercise 2: Applied

Design a database schema for an e-commerce platform that supports:

  • User accounts with addresses
  • Products with categories and reviews
  • Shopping carts
  • Orders with multiple items
  • Payment records

Specify: table structure, primary keys, foreign keys, indexes, and whether you would use SQL or NoSQL (or both).

Challenge

A ride-sharing application like Uber needs to handle:

  • 50 million active riders and 5 million active drivers
  • 10 million rides per day
  • Real-time location tracking (updates every 3 seconds per active driver)
  • Trip history that must be queryable

Design the database architecture. Should you shard? If so, what is the sharding key? Which data goes in SQL vs NoSQL? How do you handle the location data?


Summary

Concept Description
SQL Relational, ACID, structured schema, JOINs
NoSQL Flexible schema, horizontal scaling, high throughput
Index Data structure that speeds up reads at the cost of writes
Leader-Follower One write node, multiple read replicas
Sharding Splitting data across multiple databases
Hash Sharding Even distribution but no range queries
Range Sharding Supports range queries but risk of hotspots
ACID Strong consistency, transactions
BASE Eventual consistency, high availability

Key Takeaways

  1. SQL and NoSQL serve different purposes - choose based on your data model and access patterns
  2. Index strategically - index columns in WHERE and JOIN clauses but avoid over-indexing
  3. Replicate for reads, shard for writes - replication handles read scaling; sharding handles write scaling
  4. Shard only when necessary - it adds significant complexity
  5. Denormalize for read-heavy workloads - optimize for the most common query pattern

References


Next up: On Day 4, we explore Caching and CDN - learn cache strategies, eviction policies, Redis vs Memcached, CDN architecture, and the infamous cache invalidation problem.