Re: Understanding how producers and consumers behave in case of node failures in 0.8

Kane Kane Thu, 24 Oct 2013 12:00:15 -0700

>>publishing to and consumption from the partition will halt
and will not resume until the faulty leader node recovers


Can you confirm that's the case? I think they won't wait until leader
recovered and will try to elect new leader from existing non-ISR replicas?
And in case if they wait, and faulty leader never comes back?


On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Thanks Neha
>
>
> On 24 October 2013 18:11, Neha Narkhede <neha.narkh...@gmail.com> wrote:
>
> > Yes. And during retries, the producer and consumer refetch metadata.
> >
> > Thanks,
> > Neha
> > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <aniket.bhatna...@gmail.com>
> > wrote:
> >
> > > I am trying to understand and document how producers & consumers
> > > will/should behave in case of node failures in 0.8. I know there are
> > > various other threads that discuss this but I wanted to bring all the
> > > information together in one post. This should help people building
> > > producers & consumers in other languages as well. Here is my
> > understanding
> > > of how Kafak behaves in failures:
> > >
> > > Case 1: If a node fails that wasn't a leader for any partitions
> > > No impact on consumers and producers
> > >
> > > Case 2: If a leader node fails but another in sync node can be become a
> > > leader
> > > All publishing to and consumption from the partition whose leader
> failed
> > > will momentarily stop until a new leader is elected. Producers should
> > > implement retry logic in such cases (and in fact in all kinds of errors
> > > from Kafka) and consumers can (depending on your use case) either
> > continue
> > > to other partitions after retrying decent number of times (in case you
> > are
> > > fetching from partitions in round robin fashion) or keep retrying until
> > > leader is available.
> > >
> > > Case 3: If a leader node goes down and no other in sync nodes are
> > available
> > > In this case, publishing to and consumption from the partition will
> halt
> > > and will not resume until the faulty leader node recovers. In this
> case,
> > > producers should fail the publish request after retrying decent number
> of
> > > times and provide a callback to the client of the producer to take
> > > corrective action. Consumers again have a choice to continue to other
> > > partitions after retrying decent number of times (in case you are
> > fetching
> > > from partitions in round robin fashion) or keep retrying until leader
> is
> > > available. In case of latter, the entire consumer process will halt
> until
> > > the faulty node recovers.
> > >
> > > Do I have this right?
> > >
> >
>

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Reply via email to