Thanks Neha
On 24 October 2013 18:11, Neha Narkhede <neha.narkh...@gmail.com> wrote: > Yes. And during retries, the producer and consumer refetch metadata. > > Thanks, > Neha > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <aniket.bhatna...@gmail.com> > wrote: > > > I am trying to understand and document how producers & consumers > > will/should behave in case of node failures in 0.8. I know there are > > various other threads that discuss this but I wanted to bring all the > > information together in one post. This should help people building > > producers & consumers in other languages as well. Here is my > understanding > > of how Kafak behaves in failures: > > > > Case 1: If a node fails that wasn't a leader for any partitions > > No impact on consumers and producers > > > > Case 2: If a leader node fails but another in sync node can be become a > > leader > > All publishing to and consumption from the partition whose leader failed > > will momentarily stop until a new leader is elected. Producers should > > implement retry logic in such cases (and in fact in all kinds of errors > > from Kafka) and consumers can (depending on your use case) either > continue > > to other partitions after retrying decent number of times (in case you > are > > fetching from partitions in round robin fashion) or keep retrying until > > leader is available. > > > > Case 3: If a leader node goes down and no other in sync nodes are > available > > In this case, publishing to and consumption from the partition will halt > > and will not resume until the faulty leader node recovers. In this case, > > producers should fail the publish request after retrying decent number of > > times and provide a callback to the client of the producer to take > > corrective action. Consumers again have a choice to continue to other > > partitions after retrying decent number of times (in case you are > fetching > > from partitions in round robin fashion) or keep retrying until leader is > > available. In case of latter, the entire consumer process will halt until > > the faulty node recovers. > > > > Do I have this right? > > >