Learn System Design in 10 DaysDay 8: Design Social Media Feed & Chat
books.chapter 8Learn System Design in 10 Days

Day 8: Design Social Media Feed & Chat

What You'll Learn Today

  • News Feed: push model vs pull model vs hybrid approach
  • Feed ranking and timeline generation
  • Celebrity/hotkey problem and solutions
  • Media storage for social platforms
  • Chat system with WebSocket connections
  • Message delivery guarantees
  • Online presence tracking
  • Group chat design considerations

Part 1: News Feed System

The Core Problem

When a user opens their feed, they need to see recent posts from people they follow, ranked by relevance β€” all in under 200ms.


Fan-Out Strategies

flowchart TB
    subgraph Push["Push Model (Fan-Out on Write)"]
        direction TB
        P1["User creates post"]
        P2["Write to all followers' feeds"]
        P3["Followers read pre-built feed"]
        P1 --> P2 --> P3
    end
    subgraph Pull["Pull Model (Fan-Out on Read)"]
        direction TB
        R1["User opens feed"]
        R2["Query all followed users' posts"]
        R3["Merge and rank on the fly"]
        R1 --> R2 --> R3
    end
    subgraph Hybrid["Hybrid Model"]
        direction TB
        H1["Regular users β†’ Push"]
        H2["Celebrities β†’ Pull"]
        H3["Merge at read time"]
        H1 --> H3
        H2 --> H3
    end
    style Push fill:#3b82f6,color:#fff
    style Pull fill:#f59e0b,color:#fff
    style Hybrid fill:#22c55e,color:#fff
Approach How It Works Pros Cons
Push (Fan-out on write) When user posts, write to every follower's feed cache Fast reads, pre-computed Slow writes for popular users, wasted work for inactive followers
Pull (Fan-out on read) When user opens feed, fetch from all followed users No wasted computation Slow reads, heavy at read time
Hybrid Push for normal users, pull for celebrities Balances both More complex

The Celebrity/Hotkey Problem

A user with 10 million followers posting once triggers 10 million writes in the push model. This is the "hotkey" or "celebrity" problem.

Solution: Use hybrid fan-out.

  • Users with fewer than N followers (e.g., 10,000): push model
  • Users with more than N followers: pull model at read time
  • Merge both sources when the reader loads their feed

Feed Architecture

flowchart TB
    U["User Posts"]
    PS["Post Service"]
    MQ["Message Queue"]
    FW["Fan-Out Workers"]
    FC[("Feed Cache (Redis)")]
    FS["Feed Service"]
    RS["Ranking Service"]
    R["Reader"]

    U --> PS --> MQ --> FW --> FC
    R --> FS
    FS --> FC
    FS --> RS
    RS --> R

    style PS fill:#3b82f6,color:#fff
    style MQ fill:#f59e0b,color:#fff
    style FC fill:#ef4444,color:#fff
    style RS fill:#8b5cf6,color:#fff

Feed Ranking

Modern feeds are not purely chronological. They use ranking signals:

Signal Weight Description
Recency High Newer posts rank higher
Engagement High Posts with many likes/comments
Relationship Medium Posts from close friends
Content type Medium User's preferred content type
Creator quality Low Verified or high-quality creators

A simple scoring formula:

score = w1 * recency + w2 * engagement + w3 * affinity + w4 * content_type

In production, this is typically a machine learning model trained on user interactions.


Media Storage

flowchart LR
    U["Upload"]
    IS["Image Service"]
    VS["Video Service"]
    S3["Object Storage (S3)"]
    CDN["CDN"]
    R["Reader"]

    U --> IS & VS
    IS -->|"Resize, compress"| S3
    VS -->|"Transcode"| S3
    S3 --> CDN --> R

    style S3 fill:#8b5cf6,color:#fff
    style CDN fill:#22c55e,color:#fff
  • Images: Store multiple resolutions (thumbnail, medium, full)
  • Videos: Transcode to multiple bitrates
  • Delivery: Always serve through CDN
  • Storage: Object storage (S3) is ideal for media blobs

Part 2: Chat / Messaging System

Requirements

  • 1:1 messaging and group chats
  • Online/offline status
  • Read receipts
  • Message history
  • Real-time delivery

WebSocket Connections

sequenceDiagram
    participant A as User A
    participant WS as WebSocket Server
    participant B as User B

    A->>WS: Connect (WebSocket handshake)
    B->>WS: Connect (WebSocket handshake)
    Note over WS: Both connections maintained

    A->>WS: Send message to B
    WS->>B: Push message to B (real-time)
    B->>WS: Ack (delivered)
    WS->>A: Delivery receipt

Why WebSocket?

Protocol Direction Latency Use Case
HTTP Polling Client β†’ Server High (interval) Legacy fallback
Long Polling Client β†’ Server Medium Moderate real-time
Server-Sent Events Server β†’ Client Low One-way updates
WebSocket Bidirectional Very Low Chat, gaming

Chat Architecture

flowchart TB
    UA["User A"] & UB["User B"]
    WS1["WebSocket Server 1"]
    WS2["WebSocket Server 2"]
    MQ["Message Queue"]
    MS["Message Service"]
    DB[("Message Store")]
    PS["Presence Service"]
    CA["Redis (Sessions)"]

    UA -->|"WebSocket"| WS1
    UB -->|"WebSocket"| WS2
    WS1 --> MQ
    MQ --> WS2
    MQ --> MS --> DB
    WS1 & WS2 --> PS --> CA

    style MQ fill:#f59e0b,color:#fff
    style DB fill:#8b5cf6,color:#fff
    style PS fill:#22c55e,color:#fff
    style CA fill:#ef4444,color:#fff

Message Delivery Guarantees

Messages can be lost at several points. Here's how to handle each:

  1. Sender β†’ Server: Client retries on timeout with idempotency key
  2. Server β†’ Recipient (online): Push via WebSocket, recipient sends ACK
  3. Server β†’ Recipient (offline): Store in DB, deliver when recipient reconnects (push notification as alert)

Message Storage

Approach Best For Technology
Wide-column store 1:1 messages, time-ordered Cassandra, HBase
Document store Group messages, flexible schema MongoDB
Relational Small scale, complex queries PostgreSQL

Partition key: conversation_id Sort key: message_timestamp

This allows efficient range queries: "Get messages in conversation X between time A and B."


Online Presence

flowchart TB
    U["User A"]
    WS["WebSocket Server"]
    PS["Presence Service"]
    RD["Redis"]
    SUB["Subscribers (Friends)"]

    U -->|"Heartbeat every 30s"| WS
    WS --> PS
    PS -->|"Update last_seen"| RD
    PS -->|"Notify status change"| SUB

    style PS fill:#22c55e,color:#fff
    style RD fill:#ef4444,color:#fff
  • Online: Heartbeat received within threshold (e.g., last 30 seconds)
  • Away: No heartbeat for 1-5 minutes
  • Offline: No heartbeat beyond threshold or explicit disconnect
  • Store in Redis: {user_id: last_heartbeat_timestamp}

For users with many friends, don't push presence updates to all friends in real-time. Instead, query presence lazily when a user opens a chat or friend list.


Group Chat Design

Aspect Small Groups (<100) Large Groups (100+)
Delivery Push to all members Push to online members only
Storage Store per-group Store per-group
Read status Track per-member Simplified (last read pointer)
Notifications Notify all Mention-based notifications

Key considerations:

  • Message ordering: Use server-assigned timestamps or logical clocks
  • Fan-out: For small groups, push to all; for large groups, pull on demand
  • Admin controls: Roles, permissions, mute, remove members

Summary

Concept Description
Push model Pre-compute feeds on write; fast reads
Pull model Compute feeds on read; no wasted writes
Hybrid model Push for normal users, pull for celebrities
Feed ranking ML model with recency, engagement, affinity signals
WebSocket Bidirectional real-time communication
Message delivery Retry + ACK + offline storage
Presence Heartbeat-based with Redis
Group chat Fan-out strategy varies by group size

Key Takeaways

  1. The hybrid fan-out approach is used by most large social platforms β€” it's the answer interviewers expect
  2. The celebrity/hotkey problem is a common follow-up question β€” always address it proactively
  3. Chat systems need WebSocket for real-time delivery and a message queue for reliability
  4. Design for offline users: store-and-forward is essential for messaging

Practice Problems

Problem 1: Basic

Design a simple notification system that supports multiple channels (push notification, email, SMS). Define the data model and API.

Problem 2: Intermediate

You need to show "typing..." indicators in a chat app. Design the mechanism β€” how do you propagate typing status in real-time without overwhelming the server? Consider both 1:1 and group chats.

Challenge

Design a social media platform's "Stories" feature (posts that disappear after 24 hours). Consider: storage, feed generation, view tracking, and efficient expiration. How does this differ from the main news feed design?


References


Next up: In Day 9, we'll design Video Streaming (like YouTube) and Distributed File Storage (like Google Drive).