Day 3: Database Design & Scaling

What You'll Learn Today

SQL vs NoSQL databases and when to use each
How database indexing works and why it matters
Replication strategies (leader-follower, multi-leader)
Sharding and partitioning (hash, range, geographic)
ACID vs BASE consistency models
Data modeling techniques for system design interviews

SQL vs NoSQL

Choosing the right database is one of the most important decisions in system design. The choice depends on your data model, query patterns, and scale requirements.

flowchart TB
    subgraph SQL["SQL (Relational)"]
        S1["Structured Data"]
        S2["ACID Transactions"]
        S3["Complex Queries (JOIN)"]
    end
    subgraph NoSQL["NoSQL (Non-Relational)"]
        N1["Flexible Schema"]
        N2["Horizontal Scaling"]
        N3["High Throughput"]
    end
    style SQL fill:#3b82f6,color:#fff
    style NoSQL fill:#8b5cf6,color:#fff
    style S1 fill:#3b82f6,color:#fff
    style S2 fill:#3b82f6,color:#fff
    style S3 fill:#3b82f6,color:#fff
    style N1 fill:#8b5cf6,color:#fff
    style N2 fill:#8b5cf6,color:#fff
    style N3 fill:#8b5cf6,color:#fff

Aspect	SQL	NoSQL
Data model	Tables with rows and columns	Documents, key-value, wide-column, graph
Schema	Fixed schema (predefined)	Dynamic schema (flexible)
Scaling	Primarily vertical	Designed for horizontal
Transactions	Strong ACID support	Limited (varies by system)
Joins	Native support	Typically no joins
Query language	SQL (standardized)	Database-specific APIs
Best for	Complex relationships, transactions	High throughput, flexible data

Types of NoSQL Databases

Type	Examples	Best For	Data Model
Key-Value	Redis, DynamoDB	Caching, sessions	Key → Value
Document	MongoDB, CouchDB	Content management, catalogs	Key → JSON document
Wide-Column	Cassandra, HBase	Time-series, IoT, analytics	Row key → Column families
Graph	Neo4j, Amazon Neptune	Social networks, recommendations	Nodes + Edges

When to Choose What

Choose SQL when:

Data has clear relationships (users, orders, products)
You need ACID transactions (banking, inventory)
You need complex queries with JOINs
Data schema is stable

Choose NoSQL when:

Schema changes frequently
You need massive horizontal scale
Data is denormalized or semi-structured
Low-latency reads/writes are critical

Interview Tip: Many real systems use both. For example, use PostgreSQL for user accounts (relational, transactional) and Cassandra for activity feeds (high write throughput, time-series).

Database Indexing

An index is a data structure that speeds up data retrieval at the cost of additional storage and slower writes.

flowchart LR
    subgraph Without["Without Index"]
        W1["Scan ALL rows<br>O(n)"]
    end
    subgraph With["With Index (B-Tree)"]
        I1["Binary search<br>O(log n)"]
    end
    style Without fill:#ef4444,color:#fff
    style With fill:#22c55e,color:#fff
    style W1 fill:#ef4444,color:#fff
    style I1 fill:#22c55e,color:#fff

How B-Tree Indexes Work

Most SQL databases use B-Tree indexes. The data is organized in a balanced tree where:

Each node contains sorted keys
Leaf nodes point to the actual data rows
Lookups, inserts, and deletes are O(log n)

Types of Indexes

Index Type	Description	Use Case
Primary	Unique, auto-created on primary key	Row identification
Secondary	Created on non-primary columns	Speeding up queries on any column
Composite	Index on multiple columns	Queries filtering on multiple fields
Unique	Enforces uniqueness	Email, username fields
Full-text	Optimized for text search	Search functionality

Indexing Trade-offs

Benefit	Cost
Faster reads (O(log n) vs O(n))	Slower writes (index must be updated)
Efficient sorting	Additional storage space
Quick lookups	More indexes = more maintenance

Rule of thumb: Index columns that appear in WHERE, JOIN, and ORDER BY clauses. Do not over-index - every index slows down writes.

Replication

Replication copies data across multiple servers for fault tolerance and read scalability.

Leader-Follower (Master-Slave) Replication

The most common replication strategy. One leader handles all writes; followers replicate the data and serve reads.

flowchart TB
    App["Application"]
    Leader["Leader (Primary)<br>Reads + Writes"]
    F1["Follower 1<br>Reads only"]
    F2["Follower 2<br>Reads only"]
    F3["Follower 3<br>Reads only"]
    App -->|"Writes"| Leader
    App -->|"Reads"| F1
    App -->|"Reads"| F2
    App -->|"Reads"| F3
    Leader -->|"Replication"| F1
    Leader -->|"Replication"| F2
    Leader -->|"Replication"| F3
    style Leader fill:#f59e0b,color:#fff
    style F1 fill:#3b82f6,color:#fff
    style F2 fill:#3b82f6,color:#fff
    style F3 fill:#3b82f6,color:#fff

Synchronous replication: Leader waits for follower confirmation before acknowledging the write. Strong consistency but higher latency.

Asynchronous replication: Leader acknowledges the write immediately. Lower latency but risk of data loss if the leader fails.

Multi-Leader Replication

Multiple nodes accept writes. Used for multi-datacenter deployments.

flowchart TB
    subgraph DC1["Datacenter 1"]
        L1["Leader 1"]
    end
    subgraph DC2["Datacenter 2"]
        L2["Leader 2"]
    end
    L1 <-->|"Bi-directional<br>replication"| L2
    style DC1 fill:#3b82f6,color:#fff
    style DC2 fill:#8b5cf6,color:#fff
    style L1 fill:#3b82f6,color:#fff
    style L2 fill:#8b5cf6,color:#fff

Strategy	Writes	Consistency	Complexity	Use Case
Leader-Follower	Single leader	Strong (sync) or eventual (async)	Low	Most applications
Multi-Leader	Multiple leaders	Eventual (conflict resolution needed)	High	Multi-datacenter
Leaderless	Any node	Eventual (quorum reads/writes)	High	Cassandra, DynamoDB

Sharding (Partitioning)

Sharding splits data across multiple database instances. Each shard holds a subset of the data.

flowchart TB
    App["Application"]
    Router["Shard Router"]
    S1["Shard 1<br>Users A-H"]
    S2["Shard 2<br>Users I-P"]
    S3["Shard 3<br>Users Q-Z"]
    App --> Router
    Router --> S1
    Router --> S2
    Router --> S3
    style Router fill:#f59e0b,color:#fff
    style S1 fill:#3b82f6,color:#fff
    style S2 fill:#22c55e,color:#fff
    style S3 fill:#8b5cf6,color:#fff

Sharding Strategies

Hash-Based Sharding

shard_id = hash(key) % number_of_shards

Distributes data evenly but makes range queries difficult. Adding/removing shards requires rehashing (use consistent hashing to mitigate).

Range-Based Sharding

Data is split by ranges (e.g., A-H, I-P, Q-Z or by date). Simple and supports range queries but can create hotspots if data is not evenly distributed.

Geographic Sharding

Data is partitioned by geographic region. Users in Japan access the Tokyo shard; users in the US access the Virginia shard. Reduces latency but adds complexity for cross-region queries.

Strategy	Distribution	Range Queries	Hotspot Risk	Re-sharding
Hash	Even	Difficult	Low	Use consistent hashing
Range	Depends on data	Easy	High	Move ranges
Geographic	By region	Within region only	Moderate	Add regions

Challenges of Sharding

Cross-shard queries: JOINs across shards are expensive
Rebalancing: Adding shards requires data migration
Hotspots: Uneven distribution can overload one shard
Referential integrity: Foreign keys across shards are not supported
Operational complexity: More databases to manage

Interview Tip: Only introduce sharding when a single database cannot handle the load. Sharding adds significant complexity.

ACID vs BASE

These are two different consistency models for databases.

ACID (SQL databases)

Property	Description
Atomicity	All operations in a transaction succeed or all fail
Consistency	Data always moves from one valid state to another
Isolation	Concurrent transactions do not interfere
Durability	Committed data survives crashes

BASE (NoSQL databases)

Property	Description
Basically Available	System guarantees availability
Soft state	State may change over time without input
Eventual consistency	System will become consistent eventually

flowchart LR
    subgraph ACID_Model["ACID"]
        A1["Strong Consistency"]
        A2["Transactions"]
        A3["Lower Throughput"]
    end
    subgraph BASE_Model["BASE"]
        B1["Eventual Consistency"]
        B2["High Availability"]
        B3["Higher Throughput"]
    end
    style ACID_Model fill:#3b82f6,color:#fff
    style BASE_Model fill:#22c55e,color:#fff
    style A1 fill:#3b82f6,color:#fff
    style A2 fill:#3b82f6,color:#fff
    style A3 fill:#3b82f6,color:#fff
    style B1 fill:#22c55e,color:#fff
    style B2 fill:#22c55e,color:#fff
    style B3 fill:#22c55e,color:#fff

Choose ACID for financial transactions, inventory management, user authentication. Choose BASE for social media feeds, analytics, product recommendations.

Data Modeling for Interviews

When designing a database schema in an interview, follow this process:

Identify entities (users, products, orders)
Define relationships (one-to-one, one-to-many, many-to-many)
Choose primary keys (auto-increment ID or UUID)
Decide on normalization vs denormalization
Plan indexes based on query patterns

Normalization vs Denormalization

Approach	Pros	Cons
Normalized	No data redundancy, easy updates	Requires JOINs, slower reads
Denormalized	Fast reads, no JOINs	Data redundancy, complex updates

In system design interviews, read-heavy systems often benefit from denormalization. Write the data in a format that is optimized for the most common read query.

Example: E-Commerce Schema

Users
├── user_id (PK)
├── name
├── email (UNIQUE INDEX)
└── created_at

Products
├── product_id (PK)
├── name
├── price
├── category (INDEX)
└── inventory_count

Orders
├── order_id (PK)
├── user_id (FK → Users, INDEX)
├── total_amount
├── status (INDEX)
└── created_at (INDEX)

Order_Items
├── item_id (PK)
├── order_id (FK → Orders, INDEX)
├── product_id (FK → Products)
├── quantity
└── price_at_purchase

Practice Problems

Exercise 1: Basics

For each scenario, decide whether SQL or NoSQL is more appropriate and explain why:

A banking application that processes transfers between accounts
A real-time analytics dashboard tracking millions of events per second
A social network storing user profiles and friend relationships

Exercise 2: Applied

Design a database schema for an e-commerce platform that supports:

User accounts with addresses
Products with categories and reviews
Shopping carts
Orders with multiple items
Payment records

Specify: table structure, primary keys, foreign keys, indexes, and whether you would use SQL or NoSQL (or both).

Challenge

A ride-sharing application like Uber needs to handle:

50 million active riders and 5 million active drivers
10 million rides per day
Real-time location tracking (updates every 3 seconds per active driver)
Trip history that must be queryable

Design the database architecture. Should you shard? If so, what is the sharding key? Which data goes in SQL vs NoSQL? How do you handle the location data?

Summary

Concept	Description
SQL	Relational, ACID, structured schema, JOINs
NoSQL	Flexible schema, horizontal scaling, high throughput
Index	Data structure that speeds up reads at the cost of writes
Leader-Follower	One write node, multiple read replicas
Sharding	Splitting data across multiple databases
Hash Sharding	Even distribution but no range queries
Range Sharding	Supports range queries but risk of hotspots
ACID	Strong consistency, transactions
BASE	Eventual consistency, high availability

Key Takeaways

SQL and NoSQL serve different purposes - choose based on your data model and access patterns
Index strategically - index columns in WHERE and JOIN clauses but avoid over-indexing
Replicate for reads, shard for writes - replication handles read scaling; sharding handles write scaling
Shard only when necessary - it adds significant complexity
Denormalize for read-heavy workloads - optimize for the most common query pattern

References

Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
Xu, Alex. System Design Interview - An Insider's Guide. Byte Code LLC, 2020.
Use The Index, Luke
MongoDB vs PostgreSQL Comparison

Next up: On Day 4, we explore Caching and CDN - learn cache strategies, eviction policies, Redis vs Memcached, CDN architecture, and the infamous cache invalidation problem.