Day 3: Database Design & Scaling
What You'll Learn Today
- SQL vs NoSQL databases and when to use each
- How database indexing works and why it matters
- Replication strategies (leader-follower, multi-leader)
- Sharding and partitioning (hash, range, geographic)
- ACID vs BASE consistency models
- Data modeling techniques for system design interviews
SQL vs NoSQL
Choosing the right database is one of the most important decisions in system design. The choice depends on your data model, query patterns, and scale requirements.
flowchart TB
subgraph SQL["SQL (Relational)"]
S1["Structured Data"]
S2["ACID Transactions"]
S3["Complex Queries (JOIN)"]
end
subgraph NoSQL["NoSQL (Non-Relational)"]
N1["Flexible Schema"]
N2["Horizontal Scaling"]
N3["High Throughput"]
end
style SQL fill:#3b82f6,color:#fff
style NoSQL fill:#8b5cf6,color:#fff
style S1 fill:#3b82f6,color:#fff
style S2 fill:#3b82f6,color:#fff
style S3 fill:#3b82f6,color:#fff
style N1 fill:#8b5cf6,color:#fff
style N2 fill:#8b5cf6,color:#fff
style N3 fill:#8b5cf6,color:#fff
| Aspect | SQL | NoSQL |
|---|---|---|
| Data model | Tables with rows and columns | Documents, key-value, wide-column, graph |
| Schema | Fixed schema (predefined) | Dynamic schema (flexible) |
| Scaling | Primarily vertical | Designed for horizontal |
| Transactions | Strong ACID support | Limited (varies by system) |
| Joins | Native support | Typically no joins |
| Query language | SQL (standardized) | Database-specific APIs |
| Best for | Complex relationships, transactions | High throughput, flexible data |
Types of NoSQL Databases
| Type | Examples | Best For | Data Model |
|---|---|---|---|
| Key-Value | Redis, DynamoDB | Caching, sessions | Key β Value |
| Document | MongoDB, CouchDB | Content management, catalogs | Key β JSON document |
| Wide-Column | Cassandra, HBase | Time-series, IoT, analytics | Row key β Column families |
| Graph | Neo4j, Amazon Neptune | Social networks, recommendations | Nodes + Edges |
When to Choose What
Choose SQL when:
- Data has clear relationships (users, orders, products)
- You need ACID transactions (banking, inventory)
- You need complex queries with JOINs
- Data schema is stable
Choose NoSQL when:
- Schema changes frequently
- You need massive horizontal scale
- Data is denormalized or semi-structured
- Low-latency reads/writes are critical
Interview Tip: Many real systems use both. For example, use PostgreSQL for user accounts (relational, transactional) and Cassandra for activity feeds (high write throughput, time-series).
Database Indexing
An index is a data structure that speeds up data retrieval at the cost of additional storage and slower writes.
flowchart LR
subgraph Without["Without Index"]
W1["Scan ALL rows<br>O(n)"]
end
subgraph With["With Index (B-Tree)"]
I1["Binary search<br>O(log n)"]
end
style Without fill:#ef4444,color:#fff
style With fill:#22c55e,color:#fff
style W1 fill:#ef4444,color:#fff
style I1 fill:#22c55e,color:#fff
How B-Tree Indexes Work
Most SQL databases use B-Tree indexes. The data is organized in a balanced tree where:
- Each node contains sorted keys
- Leaf nodes point to the actual data rows
- Lookups, inserts, and deletes are O(log n)
Types of Indexes
| Index Type | Description | Use Case |
|---|---|---|
| Primary | Unique, auto-created on primary key | Row identification |
| Secondary | Created on non-primary columns | Speeding up queries on any column |
| Composite | Index on multiple columns | Queries filtering on multiple fields |
| Unique | Enforces uniqueness | Email, username fields |
| Full-text | Optimized for text search | Search functionality |
Indexing Trade-offs
| Benefit | Cost |
|---|---|
| Faster reads (O(log n) vs O(n)) | Slower writes (index must be updated) |
| Efficient sorting | Additional storage space |
| Quick lookups | More indexes = more maintenance |
Rule of thumb: Index columns that appear in WHERE, JOIN, and ORDER BY clauses. Do not over-index - every index slows down writes.
Replication
Replication copies data across multiple servers for fault tolerance and read scalability.
Leader-Follower (Master-Slave) Replication
The most common replication strategy. One leader handles all writes; followers replicate the data and serve reads.
flowchart TB
App["Application"]
Leader["Leader (Primary)<br>Reads + Writes"]
F1["Follower 1<br>Reads only"]
F2["Follower 2<br>Reads only"]
F3["Follower 3<br>Reads only"]
App -->|"Writes"| Leader
App -->|"Reads"| F1
App -->|"Reads"| F2
App -->|"Reads"| F3
Leader -->|"Replication"| F1
Leader -->|"Replication"| F2
Leader -->|"Replication"| F3
style Leader fill:#f59e0b,color:#fff
style F1 fill:#3b82f6,color:#fff
style F2 fill:#3b82f6,color:#fff
style F3 fill:#3b82f6,color:#fff
Synchronous replication: Leader waits for follower confirmation before acknowledging the write. Strong consistency but higher latency.
Asynchronous replication: Leader acknowledges the write immediately. Lower latency but risk of data loss if the leader fails.
Multi-Leader Replication
Multiple nodes accept writes. Used for multi-datacenter deployments.
flowchart TB
subgraph DC1["Datacenter 1"]
L1["Leader 1"]
end
subgraph DC2["Datacenter 2"]
L2["Leader 2"]
end
L1 <-->|"Bi-directional<br>replication"| L2
style DC1 fill:#3b82f6,color:#fff
style DC2 fill:#8b5cf6,color:#fff
style L1 fill:#3b82f6,color:#fff
style L2 fill:#8b5cf6,color:#fff
| Strategy | Writes | Consistency | Complexity | Use Case |
|---|---|---|---|---|
| Leader-Follower | Single leader | Strong (sync) or eventual (async) | Low | Most applications |
| Multi-Leader | Multiple leaders | Eventual (conflict resolution needed) | High | Multi-datacenter |
| Leaderless | Any node | Eventual (quorum reads/writes) | High | Cassandra, DynamoDB |
Sharding (Partitioning)
Sharding splits data across multiple database instances. Each shard holds a subset of the data.
flowchart TB
App["Application"]
Router["Shard Router"]
S1["Shard 1<br>Users A-H"]
S2["Shard 2<br>Users I-P"]
S3["Shard 3<br>Users Q-Z"]
App --> Router
Router --> S1
Router --> S2
Router --> S3
style Router fill:#f59e0b,color:#fff
style S1 fill:#3b82f6,color:#fff
style S2 fill:#22c55e,color:#fff
style S3 fill:#8b5cf6,color:#fff
Sharding Strategies
Hash-Based Sharding
shard_id = hash(key) % number_of_shards
Distributes data evenly but makes range queries difficult. Adding/removing shards requires rehashing (use consistent hashing to mitigate).
Range-Based Sharding
Data is split by ranges (e.g., A-H, I-P, Q-Z or by date). Simple and supports range queries but can create hotspots if data is not evenly distributed.
Geographic Sharding
Data is partitioned by geographic region. Users in Japan access the Tokyo shard; users in the US access the Virginia shard. Reduces latency but adds complexity for cross-region queries.
| Strategy | Distribution | Range Queries | Hotspot Risk | Re-sharding |
|---|---|---|---|---|
| Hash | Even | Difficult | Low | Use consistent hashing |
| Range | Depends on data | Easy | High | Move ranges |
| Geographic | By region | Within region only | Moderate | Add regions |
Challenges of Sharding
- Cross-shard queries: JOINs across shards are expensive
- Rebalancing: Adding shards requires data migration
- Hotspots: Uneven distribution can overload one shard
- Referential integrity: Foreign keys across shards are not supported
- Operational complexity: More databases to manage
Interview Tip: Only introduce sharding when a single database cannot handle the load. Sharding adds significant complexity.
ACID vs BASE
These are two different consistency models for databases.
ACID (SQL databases)
| Property | Description |
|---|---|
| Atomicity | All operations in a transaction succeed or all fail |
| Consistency | Data always moves from one valid state to another |
| Isolation | Concurrent transactions do not interfere |
| Durability | Committed data survives crashes |
BASE (NoSQL databases)
| Property | Description |
|---|---|
| Basically Available | System guarantees availability |
| Soft state | State may change over time without input |
| Eventual consistency | System will become consistent eventually |
flowchart LR
subgraph ACID_Model["ACID"]
A1["Strong Consistency"]
A2["Transactions"]
A3["Lower Throughput"]
end
subgraph BASE_Model["BASE"]
B1["Eventual Consistency"]
B2["High Availability"]
B3["Higher Throughput"]
end
style ACID_Model fill:#3b82f6,color:#fff
style BASE_Model fill:#22c55e,color:#fff
style A1 fill:#3b82f6,color:#fff
style A2 fill:#3b82f6,color:#fff
style A3 fill:#3b82f6,color:#fff
style B1 fill:#22c55e,color:#fff
style B2 fill:#22c55e,color:#fff
style B3 fill:#22c55e,color:#fff
Choose ACID for financial transactions, inventory management, user authentication. Choose BASE for social media feeds, analytics, product recommendations.
Data Modeling for Interviews
When designing a database schema in an interview, follow this process:
- Identify entities (users, products, orders)
- Define relationships (one-to-one, one-to-many, many-to-many)
- Choose primary keys (auto-increment ID or UUID)
- Decide on normalization vs denormalization
- Plan indexes based on query patterns
Normalization vs Denormalization
| Approach | Pros | Cons |
|---|---|---|
| Normalized | No data redundancy, easy updates | Requires JOINs, slower reads |
| Denormalized | Fast reads, no JOINs | Data redundancy, complex updates |
In system design interviews, read-heavy systems often benefit from denormalization. Write the data in a format that is optimized for the most common read query.
Example: E-Commerce Schema
Users
βββ user_id (PK)
βββ name
βββ email (UNIQUE INDEX)
βββ created_at
Products
βββ product_id (PK)
βββ name
βββ price
βββ category (INDEX)
βββ inventory_count
Orders
βββ order_id (PK)
βββ user_id (FK β Users, INDEX)
βββ total_amount
βββ status (INDEX)
βββ created_at (INDEX)
Order_Items
βββ item_id (PK)
βββ order_id (FK β Orders, INDEX)
βββ product_id (FK β Products)
βββ quantity
βββ price_at_purchase
Practice Problems
Exercise 1: Basics
For each scenario, decide whether SQL or NoSQL is more appropriate and explain why:
- A banking application that processes transfers between accounts
- A real-time analytics dashboard tracking millions of events per second
- A social network storing user profiles and friend relationships
Exercise 2: Applied
Design a database schema for an e-commerce platform that supports:
- User accounts with addresses
- Products with categories and reviews
- Shopping carts
- Orders with multiple items
- Payment records
Specify: table structure, primary keys, foreign keys, indexes, and whether you would use SQL or NoSQL (or both).
Challenge
A ride-sharing application like Uber needs to handle:
- 50 million active riders and 5 million active drivers
- 10 million rides per day
- Real-time location tracking (updates every 3 seconds per active driver)
- Trip history that must be queryable
Design the database architecture. Should you shard? If so, what is the sharding key? Which data goes in SQL vs NoSQL? How do you handle the location data?
Summary
| Concept | Description |
|---|---|
| SQL | Relational, ACID, structured schema, JOINs |
| NoSQL | Flexible schema, horizontal scaling, high throughput |
| Index | Data structure that speeds up reads at the cost of writes |
| Leader-Follower | One write node, multiple read replicas |
| Sharding | Splitting data across multiple databases |
| Hash Sharding | Even distribution but no range queries |
| Range Sharding | Supports range queries but risk of hotspots |
| ACID | Strong consistency, transactions |
| BASE | Eventual consistency, high availability |
Key Takeaways
- SQL and NoSQL serve different purposes - choose based on your data model and access patterns
- Index strategically - index columns in WHERE and JOIN clauses but avoid over-indexing
- Replicate for reads, shard for writes - replication handles read scaling; sharding handles write scaling
- Shard only when necessary - it adds significant complexity
- Denormalize for read-heavy workloads - optimize for the most common query pattern
References
- Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
- Xu, Alex. System Design Interview - An Insider's Guide. Byte Code LLC, 2020.
- Use The Index, Luke
- MongoDB vs PostgreSQL Comparison
Next up: On Day 4, we explore Caching and CDN - learn cache strategies, eviction policies, Redis vs Memcached, CDN architecture, and the infamous cache invalidation problem.