Learn System Design in 10 DaysDay 5: Message Queues & Async Processing
books.chapter 5Learn System Design in 10 Days

Day 5: Message Queues & Async Processing

What You'll Learn Today

  • Synchronous vs asynchronous communication and when to use each
  • Message queue core concepts (Producer, Consumer, Topic, Partition)
  • Kafka vs RabbitMQ vs SQS comparison
  • Event-driven architecture patterns
  • Pub/Sub pattern and its applications
  • Delivery semantics: exactly-once, at-least-once, at-most-once

Synchronous vs Asynchronous Communication

In a synchronous system, the caller waits for a response before proceeding. In an asynchronous system, the caller sends a message and continues without waiting.

flowchart TB
    subgraph Sync["Synchronous"]
        SA["Service A"]
        SB["Service B"]
        SA -->|"1. Request"| SB
        SB -->|"2. Wait..."| SB
        SB -->|"3. Response"| SA
    end
    subgraph Async["Asynchronous"]
        AA["Service A"]
        Q["Message Queue"]
        AB["Service B"]
        AA -->|"1. Send message"| Q
        Q -->|"2. Process later"| AB
    end
    style Sync fill:#ef4444,color:#fff
    style Async fill:#22c55e,color:#fff
    style SA fill:#ef4444,color:#fff
    style SB fill:#ef4444,color:#fff
    style AA fill:#22c55e,color:#fff
    style Q fill:#f59e0b,color:#fff
    style AB fill:#22c55e,color:#fff
Aspect Synchronous Asynchronous
Coupling Tight (caller depends on receiver) Loose (decoupled by queue)
Latency Caller waits for full processing Caller returns immediately
Failure handling Caller fails if receiver is down Message is queued; retried later
Scalability Limited by slowest service Services scale independently
Complexity Simple to implement Requires message infrastructure
Debugging Easy (linear flow) Harder (distributed, async)

When to Use Async

  • Long-running tasks: Video transcoding, report generation, email sending
  • Spiky workloads: Flash sales, viral content
  • Decoupling services: Order placement triggers inventory, payment, notification independently
  • Reliability: Tasks that must not be lost even if a service is temporarily down

When Sync Is Fine

  • Low-latency user-facing requests: Login, search, page loads
  • Simple request-response: REST APIs where the client needs an immediate answer
  • Strong consistency required: Payment verification before confirming an order

Message Queue Concepts

A message queue is a buffer that sits between producers and consumers, enabling asynchronous communication.

flowchart LR
    P1["Producer 1"]
    P2["Producer 2"]
    Q["Message Queue"]
    C1["Consumer 1"]
    C2["Consumer 2"]
    C3["Consumer 3"]
    P1 --> Q
    P2 --> Q
    Q --> C1
    Q --> C2
    Q --> C3
    style Q fill:#f59e0b,color:#fff
    style P1 fill:#3b82f6,color:#fff
    style P2 fill:#3b82f6,color:#fff
    style C1 fill:#22c55e,color:#fff
    style C2 fill:#22c55e,color:#fff
    style C3 fill:#22c55e,color:#fff

Core Terminology

Term Definition
Producer Sends messages to the queue
Consumer Reads and processes messages from the queue
Queue/Topic Named destination where messages are stored
Partition Subdivision of a topic for parallel processing
Offset Position of a message within a partition
Consumer Group Set of consumers that share the workload of a topic
Broker Server that stores and delivers messages
Dead Letter Queue Queue for messages that repeatedly fail processing

Topics and Partitions

A topic is a logical channel. Partitions allow a topic to be split across multiple brokers for parallel processing.

flowchart TB
    subgraph Topic["Topic: orders"]
        P0["Partition 0<br>msg1, msg4, msg7"]
        P1["Partition 1<br>msg2, msg5, msg8"]
        P2["Partition 2<br>msg3, msg6, msg9"]
    end
    subgraph CG["Consumer Group"]
        C0["Consumer 0 β†’ P0"]
        C1["Consumer 1 β†’ P1"]
        C2["Consumer 2 β†’ P2"]
    end
    P0 --> C0
    P1 --> C1
    P2 --> C2
    style Topic fill:#8b5cf6,color:#fff
    style P0 fill:#8b5cf6,color:#fff
    style P1 fill:#8b5cf6,color:#fff
    style P2 fill:#8b5cf6,color:#fff
    style CG fill:#22c55e,color:#fff
    style C0 fill:#22c55e,color:#fff
    style C1 fill:#22c55e,color:#fff
    style C2 fill:#22c55e,color:#fff

Each partition is consumed by exactly one consumer within a consumer group. To increase parallelism, add more partitions and consumers.


Kafka vs RabbitMQ vs SQS

Apache Kafka

A distributed event streaming platform designed for high-throughput, durable message processing. Messages are persisted to disk and retained for a configurable period.

Architecture: Producers write to topic partitions; consumers pull messages and track their own offsets.

RabbitMQ

A traditional message broker implementing AMQP. Messages are pushed to consumers and removed from the queue after acknowledgment.

Architecture: Producers send to exchanges; exchanges route to queues based on routing rules; consumers receive messages from queues.

Amazon SQS

A fully managed message queue service. No infrastructure to manage. Offers Standard (at-least-once, best-effort ordering) and FIFO (exactly-once, strict ordering) queues.

flowchart TB
    subgraph Kafka_Arch["Kafka"]
        KP["Producer"]
        KT["Topic<br>(Partitioned Log)"]
        KC["Consumer<br>(Pull-based)"]
        KP --> KT --> KC
    end
    subgraph Rabbit_Arch["RabbitMQ"]
        RP["Producer"]
        RE["Exchange"]
        RQ["Queue"]
        RC["Consumer<br>(Push-based)"]
        RP --> RE --> RQ --> RC
    end
    subgraph SQS_Arch["Amazon SQS"]
        SP["Producer"]
        SQ["Queue<br>(Managed)"]
        SC["Consumer<br>(Poll-based)"]
        SP --> SQ --> SC
    end
    style Kafka_Arch fill:#3b82f6,color:#fff
    style Rabbit_Arch fill:#8b5cf6,color:#fff
    style SQS_Arch fill:#f59e0b,color:#fff
    style KP fill:#3b82f6,color:#fff
    style KT fill:#3b82f6,color:#fff
    style KC fill:#3b82f6,color:#fff
    style RP fill:#8b5cf6,color:#fff
    style RE fill:#8b5cf6,color:#fff
    style RQ fill:#8b5cf6,color:#fff
    style RC fill:#8b5cf6,color:#fff
    style SP fill:#f59e0b,color:#fff
    style SQ fill:#f59e0b,color:#fff
    style SC fill:#f59e0b,color:#fff
Feature Kafka RabbitMQ SQS
Model Distributed log Message broker Managed queue
Throughput Very high (millions/sec) Moderate (tens of thousands/sec) High (managed)
Ordering Per partition Per queue FIFO queues only
Retention Configurable (days/weeks) Until consumed 14 days max
Delivery Pull-based Push-based Poll-based
Replay Yes (consumers re-read) No (message deleted after ack) No
Scaling Add partitions Add queues/consumers Automatic
Operations Self-managed or managed (Confluent) Self-managed or managed Fully managed (AWS)
Best for Event streaming, logs, analytics Task queues, RPC, routing Simple async tasks, AWS-native

When to Choose What

Kafka: Event streaming, log aggregation, real-time analytics, data pipelines. When you need message replay and high throughput.

RabbitMQ: Task distribution, RPC patterns, complex routing. When you need flexible routing and push-based delivery.

SQS: Simple async processing, AWS-native workloads. When you want zero infrastructure management.


Event-Driven Architecture

In an event-driven architecture, services communicate by producing and consuming events. This pattern promotes loose coupling and independent scalability.

flowchart TB
    OS["Order Service"]
    EB["Event Bus<br>(Kafka)"]
    IS["Inventory Service"]
    PS["Payment Service"]
    NS["Notification Service"]
    AS["Analytics Service"]

    OS -->|"OrderPlaced"| EB
    EB -->|"OrderPlaced"| IS
    EB -->|"OrderPlaced"| PS
    EB -->|"OrderPlaced"| NS
    EB -->|"OrderPlaced"| AS

    style EB fill:#f59e0b,color:#fff
    style OS fill:#3b82f6,color:#fff
    style IS fill:#22c55e,color:#fff
    style PS fill:#8b5cf6,color:#fff
    style NS fill:#ef4444,color:#fff
    style AS fill:#22c55e,color:#fff

Benefits

  • Loose coupling: Services do not call each other directly
  • Independent scaling: Each service scales based on its own load
  • Resilience: If one service is down, events are queued and processed later
  • Extensibility: Add new consumers without modifying the producer

Challenges

  • Eventual consistency: Events take time to propagate
  • Debugging: Tracing a request across async services is complex
  • Ordering: Events may arrive out of order
  • Idempotency: Consumers must handle duplicate events gracefully

Pub/Sub Pattern

Publish-Subscribe allows multiple subscribers to receive the same message. Unlike point-to-point queues (where each message is consumed once), pub/sub broadcasts to all subscribers.

flowchart TB
    Pub["Publisher"]
    Topic["Topic: user-signup"]
    Sub1["Subscriber 1<br>Send welcome email"]
    Sub2["Subscriber 2<br>Create default settings"]
    Sub3["Subscriber 3<br>Track analytics"]

    Pub -->|"Publish"| Topic
    Topic -->|"Deliver"| Sub1
    Topic -->|"Deliver"| Sub2
    Topic -->|"Deliver"| Sub3

    style Topic fill:#f59e0b,color:#fff
    style Pub fill:#3b82f6,color:#fff
    style Sub1 fill:#22c55e,color:#fff
    style Sub2 fill:#8b5cf6,color:#fff
    style Sub3 fill:#22c55e,color:#fff
Pattern Message Delivery Use Case
Point-to-Point (Queue) Each message consumed by one consumer Task distribution, work queues
Pub/Sub (Topic) Each message delivered to all subscribers Event broadcasting, notifications
Fan-out One message triggers multiple independent actions Order processing (payment + inventory + notification)

Delivery Semantics

Message delivery guarantees determine how many times a consumer processes each message.

Semantic Description Duplicates? Message Loss?
At-most-once Message delivered 0 or 1 times No Possible
At-least-once Message delivered 1 or more times Possible No
Exactly-once Message delivered exactly 1 time No No

At-Most-Once

The producer sends the message and does not retry. If delivery fails, the message is lost. Fast but unreliable.

Use case: Metrics collection where occasional data loss is acceptable.

At-Least-Once

The producer retries until the message is acknowledged. The consumer may receive duplicates if the acknowledgment is lost.

Use case: Order processing (duplicates handled by idempotency).

Exactly-Once

The hardest guarantee to achieve. Requires idempotent consumers or transactional mechanisms. Kafka supports exactly-once semantics through transactions and idempotent producers.

Use case: Financial transactions, inventory updates.

flowchart TB
    subgraph AtMost["At-Most-Once"]
        AM1["Send"]
        AM2["No retry"]
        AM1 --> AM2
    end
    subgraph AtLeast["At-Least-Once"]
        AL1["Send"]
        AL2["Retry until ACK"]
        AL3["Consumer must be<br>idempotent"]
        AL1 --> AL2 --> AL3
    end
    subgraph Exactly["Exactly-Once"]
        EO1["Send with<br>transaction ID"]
        EO2["Deduplicate<br>on consumer"]
        EO1 --> EO2
    end
    style AtMost fill:#22c55e,color:#fff
    style AtLeast fill:#f59e0b,color:#fff
    style Exactly fill:#ef4444,color:#fff
    style AM1 fill:#22c55e,color:#fff
    style AM2 fill:#22c55e,color:#fff
    style AL1 fill:#f59e0b,color:#fff
    style AL2 fill:#f59e0b,color:#fff
    style AL3 fill:#f59e0b,color:#fff
    style EO1 fill:#ef4444,color:#fff
    style EO2 fill:#ef4444,color:#fff

Idempotency

Since at-least-once is the most practical guarantee, your consumers should be idempotent - processing the same message twice produces the same result.

Techniques for idempotency:

  • Idempotency key: Store processed message IDs; skip duplicates
  • Database constraints: Use UPSERT or unique constraints
  • Conditional updates: Only update if current state matches expected state

Designing an Order Processing System

Let us apply these concepts to a real-world example.

flowchart TB
    Client["Client"]
    API["API Gateway"]
    OS["Order Service"]
    Queue["Message Queue<br>(Kafka)"]
    Pay["Payment Service"]
    Inv["Inventory Service"]
    Ship["Shipping Service"]
    Notif["Notification Service"]
    DLQ["Dead Letter Queue"]

    Client --> API --> OS
    OS -->|"OrderCreated"| Queue
    Queue --> Pay
    Queue --> Inv
    Pay -->|"PaymentCompleted"| Queue
    Queue --> Ship
    Queue --> Notif
    Pay -->|"PaymentFailed"| DLQ

    style Queue fill:#f59e0b,color:#fff
    style DLQ fill:#ef4444,color:#fff
    style OS fill:#3b82f6,color:#fff
    style Pay fill:#8b5cf6,color:#fff
    style Inv fill:#22c55e,color:#fff
    style Ship fill:#22c55e,color:#fff
    style Notif fill:#22c55e,color:#fff

Design decisions:

  1. Kafka for the message queue (high throughput, event replay for debugging)
  2. At-least-once delivery with idempotent consumers (order ID as idempotency key)
  3. Dead letter queue for failed payments (manual review and retry)
  4. Event-driven decoupling (each service processes events independently)
  5. Separate topics for different event types (OrderCreated, PaymentCompleted, etc.)

Practice Problems

Exercise 1: Basics

Explain the difference between these three scenarios and which delivery semantic is appropriate:

  1. Sending marketing push notifications to mobile users
  2. Processing credit card charges for online purchases
  3. Logging page view events for analytics

Exercise 2: Applied

Design an order processing system for an e-commerce platform that handles:

  • 50,000 orders per day
  • Each order triggers: payment processing, inventory update, email confirmation, analytics event
  • Payment failures must be retried up to 3 times
  • Orders must not be lost or double-processed

Specify: message queue choice (Kafka, RabbitMQ, or SQS), topic/queue design, delivery guarantee, idempotency strategy, and dead letter queue handling.

Challenge

Design a real-time notification system for a social media platform that handles:

  • 500 million users
  • 10 billion events per day (likes, comments, follows, mentions)
  • Notifications must be delivered within 1 second for online users
  • Offline users receive notifications when they come online
  • Users can configure notification preferences (mute, digest, real-time)

Include: event ingestion pipeline, message queue architecture, consumer design, delivery strategy for online vs offline users, and how to handle the thundering herd when a celebrity posts.


Summary

Concept Description
Async Communication Producer sends message and continues; consumer processes later
Message Queue Buffer between producers and consumers
Topic/Partition Logical channel subdivided for parallel processing
Consumer Group Set of consumers sharing workload
Kafka High-throughput distributed log with replay
RabbitMQ Traditional broker with flexible routing
SQS Fully managed AWS queue service
Event-Driven Services communicate through events, not direct calls
Pub/Sub One message delivered to all subscribers
At-Least-Once Retry until acknowledged; consumer must be idempotent
Dead Letter Queue Collects messages that fail processing repeatedly

Key Takeaways

  1. Use async processing for anything that does not need an immediate response - it improves resilience and scalability
  2. Kafka is the default choice for event streaming in system design interviews
  3. At-least-once with idempotent consumers is the most practical delivery guarantee
  4. Event-driven architecture promotes loose coupling but requires careful handling of eventual consistency
  5. Always include a dead letter queue for messages that cannot be processed
  6. Design consumers to be idempotent - this is the single most important principle for reliable message processing

References


Next up: On Day 6, we explore API Design and Rate Limiting - covering REST vs GraphQL, API gateway patterns, rate limiting algorithms, and authentication strategies for distributed systems.