Sorry for the long email, but I've tried to keep it organized, at least. "Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees." and "Messages are persisted on disk and replicated within the cluster to prevent data loss." according to http://kafka.apache.org/.
I'm trying to understand what this means in some detail. So, two questions. 1. Fault-Tolerance If a Broker in a Kafka cluster fails (the EC2 instance dies), what happens? After, let's say I add a new Broker to the cluster (that my responsibility, not Kafka's). What happens when it rejoins? To be more particular, if the cluster consists of a Zookeeper and B (3, for example) Brokers, can a Kafka system guarantee to tolerate up to B-1 (2, for example) Broker failures? 2. Durability at an application level What are the guarantees about durability, at an application level, in practice? By "application level" I mean guarantees that a produced message gets consumed and acted upon by an application that uses Kafka. My understanding at present is that Kafka does not make these kinds of guarantees because there are no acks. So, it is up to the application developer to handle it. Is this right? Here's my understanding: Having messages persisted on disk and replicated is why Kafka has durability guarantees. But, from an application perspective, what happens when a consumer pulls a message but fails before acting on it? That would update the Kafka consumer offset, right? So, without some thinking and planning ahead on the Kafka system design, the application's consumers would not have a way of knowing that a message was not actually processed. Conclusion / Last Question I'm interested in making the chance of message loss minimal, at a system level. Any pointers on what to read or think about would be appreciated! Thanks! -David