I am trying to understand and document how producers & consumers
will/should behave in case of node failures in 0.8. I know there are
various other threads that discuss this but I wanted to bring all the
information together in one post. This should help people building
producers & consumers in other languages as well. Here is my understanding
of how Kafak behaves in failures:

Case 1: If a node fails that wasn't a leader for any partitions
No impact on consumers and producers

Case 2: If a leader node fails but another in sync node can be become a
leader
All publishing to and consumption from the partition whose leader failed
will momentarily stop until a new leader is elected. Producers should
implement retry logic in such cases (and in fact in all kinds of errors
from Kafka) and consumers can (depending on your use case) either continue
to other partitions after retrying decent number of times (in case you are
fetching from partitions in round robin fashion) or keep retrying until
leader is available.

Case 3: If a leader node goes down and no other in sync nodes are available
In this case, publishing to and consumption from the partition will halt
and will not resume until the faulty leader node recovers. In this case,
producers should fail the publish request after retrying decent number of
times and provide a callback to the client of the producer to take
corrective action. Consumers again have a choice to continue to other
partitions after retrying decent number of times (in case you are fetching
from partitions in round robin fashion) or keep retrying until leader is
available. In case of latter, the entire consumer process will halt until
the faulty node recovers.

Do I have this right?

Reply via email to