I am planning to use kafka 0.8 spout and after studing the source code found that it doesnt handle errors. There is a fork that adds try catch over using fetchResponse but my guess is this will lead to spout attempting the same partition infinitely until the leader is elected/comes back online. I will try and submit a pull request to kafka 0.8 plus spout to fix this issue in a couple of days. On 25 Oct 2013 02:58, "Chris Bedford" <ch...@buildlackey.com> wrote:
> Thank you for posting these guidelines. I'm wondering if anyone out there > that is using the Kafka spout (for Storm) knows whether or not the Kafka > spout takes care of these types of details ? > > regards > -chris > > > > On Thu, Oct 24, 2013 at 2:05 PM, Neha Narkhede <neha.narkh...@gmail.com > >wrote: > > > Yes, when a leader dies, the preference is to pick a leader from the ISR. > > If not, the leader is picked from any other available replica. But if no > > replicas are alive, the partition goes offline and all production and > > consumption halts, until at least one replica is brought online. > > > > Thanks, > > Neha > > > > > > On Thu, Oct 24, 2013 at 11:57 AM, Kane Kane <kane.ist...@gmail.com> > wrote: > > > > > >>publishing to and consumption from the partition will halt > > > and will not resume until the faulty leader node recovers > > > > > > Can you confirm that's the case? I think they won't wait until leader > > > recovered and will try to elect new leader from existing non-ISR > > replicas? > > > And in case if they wait, and faulty leader never comes back? > > > > > > > > > On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar < > > > aniket.bhatna...@gmail.com> wrote: > > > > > > > Thanks Neha > > > > > > > > > > > > On 24 October 2013 18:11, Neha Narkhede <neha.narkh...@gmail.com> > > wrote: > > > > > > > > > Yes. And during retries, the producer and consumer refetch > metadata. > > > > > > > > > > Thanks, > > > > > Neha > > > > > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" < > > > aniket.bhatna...@gmail.com> > > > > > wrote: > > > > > > > > > > > I am trying to understand and document how producers & consumers > > > > > > will/should behave in case of node failures in 0.8. I know there > > are > > > > > > various other threads that discuss this but I wanted to bring all > > the > > > > > > information together in one post. This should help people > building > > > > > > producers & consumers in other languages as well. Here is my > > > > > understanding > > > > > > of how Kafak behaves in failures: > > > > > > > > > > > > Case 1: If a node fails that wasn't a leader for any partitions > > > > > > No impact on consumers and producers > > > > > > > > > > > > Case 2: If a leader node fails but another in sync node can be > > > become a > > > > > > leader > > > > > > All publishing to and consumption from the partition whose leader > > > > failed > > > > > > will momentarily stop until a new leader is elected. Producers > > should > > > > > > implement retry logic in such cases (and in fact in all kinds of > > > errors > > > > > > from Kafka) and consumers can (depending on your use case) either > > > > > continue > > > > > > to other partitions after retrying decent number of times (in > case > > > you > > > > > are > > > > > > fetching from partitions in round robin fashion) or keep retrying > > > until > > > > > > leader is available. > > > > > > > > > > > > Case 3: If a leader node goes down and no other in sync nodes are > > > > > available > > > > > > In this case, publishing to and consumption from the partition > will > > > > halt > > > > > > and will not resume until the faulty leader node recovers. In > this > > > > case, > > > > > > producers should fail the publish request after retrying decent > > > number > > > > of > > > > > > times and provide a callback to the client of the producer to > take > > > > > > corrective action. Consumers again have a choice to continue to > > other > > > > > > partitions after retrying decent number of times (in case you are > > > > > fetching > > > > > > from partitions in round robin fashion) or keep retrying until > > leader > > > > is > > > > > > available. In case of latter, the entire consumer process will > halt > > > > until > > > > > > the faulty node recovers. > > > > > > > > > > > > Do I have this right? > > > > > > > > > > > > > > > > > > > > > > > > -- > Chris Bedford > > Founder & Lead Lackey > Build Lackey Labs: http://buildlackey.com > Go Grails!: http://blog.buildlackey.com >