Todo: https://medium.com/@yuvarajl/why-nutanix-beam-went-ahead-with-apache-pulsar-instead-of-apache-kafka-1415f592dbbb
Also read https://pk.org/417/notes/kafka.html
Goal: Create a distributed messaging system to handle large-scale streams of messages.
How can a cluster of computers handle the influx of never-ending streams of data, coming from multiple sources? This data may come from industrial sensors, IoT devices scattered around the world, or log files from tens of thousands of systems in a data center.
It’s easy enough to say that we can divide the work among multiple computers but how would we exactly do that?
Overview
Ref: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-overview.html
https://stackoverflow.com/questions/41744506/difference-between-stream-processing-and-message-processing
Event broker vs Message queue
MESSAGE QUEUE
Messages are put onto a queue and a consumer consumes the message and processes them. Messages are acknowledged as consumed and deleted afterwards. Messages are split between consumers which makes it hard to communicate system with events.
Example of this would be Amazon SQS. Publish messages to the queue and then listen to them, process them and they are removed from the queue.
EVENT BROKER
Event brokers are a push system, they push these events downstream to consumers. Example of this would be Amazon EventBridge.
Ref: https://serverlessland.com/event-driven-architecture/visuals/message-queue-vs-event-broker
Message Queue (MQ)
- Designed for delivery guarantees (at-least-once, exactly-once).
- Great for sporadic workloads and asynchronous tasks.
- Messages are consumed and removed. Once acknowledged, they’re gone.
- Think: order processing, background jobs, retry logic.
Stream
- Designed for real-time, high-throughput, continuous data.
- Data is append-only and replayable.
- Multiple consumers can read the same stream independently.
- Think: real-time analytics dashboards, fraud detection, monitoring pipelines.
A message queue here would introduce unnecessary latency and discard historical events that analytics needs to stay accurate.
Rule of thumb:
- If you care about delivery → Message Queue.
- If you care about freshness & scale → Stream.
That’s why the right answer here is: Streams for real-time analytics.
Why Streams fit Aaron’s case (real-time analytics for e-commerce):
- Freshness matters → Streams process events as they arrive, with low latency.
- Replayability matters → If a consumer goes down, it can re-read from the offset.
- Scale matters → Streams partition data for massive parallel consumption.
How Google PubSub achieves similar functionality
By separating Topic and Subscription
AMQP Protocol
RabbitMQ supports different types of Exchanges, to achieve Pubsub-like functionality, fanout with multiple queues as subscription is good
Similarly, in SNS + SQS for achieving similar functionality
Kafka
Read at https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Topics and Paritions
Ref: Kafka white paper
Consumers
Ref: https://stackoverflow.com/questions/36203764/how-can-i-scale-kafka-consumers to read about scaling of consuming
Write scalability
Read scalability
Ref: https://www.instaclustr.com/blog/the-power-of-kafka-partitions-how-to-get-the-most-out-of-your-kafka-cluster/
Zookeeper
TODO:
Zuul architecture, https://www.youtube.com/watch?v=6w6E_B55p0E