Hanish, In this case I believe it is a bug for kill -9 scenario. Could you file a jira and describe the process to reproduce?
Guozhang On Mon, Dec 23, 2013 at 7:42 PM, Hanish Bansal < hanish.bansal.agar...@gmail.com> wrote: > Sorry last message was sent by mistake. > > Hi Guazhang, > > Please find my comments below : > > > On Tue, Dec 24, 2013 at 12:52 AM, Guozhang Wang <wangg...@gmail.com> > wrote: > > > Hanish, > > > > Originally when you create the two partitions their leadership should be > > evenly distributed to two brokers, i.e. one broker get one partition. > > But from your case broker 1 is the leader for both partition 1 and 0, and > > from the replica list broker 0 should be originally the leader for > > partition1 since the leader of a partition should be the first one in the > > replica list. > > > > > *When i am creating the topic in that case then their leadership is evenly > distributed to two brokers as you said. And yes one important thing is that > when their leadership is evenly distributed to two brokers(lets say > broker-0 is leader of partition 1 and broker-1 is leader of partition 0) > the there is NO DATA LOSS. But my scenario is occurring if i restart any > one node after topic created,Because there is only one live broker for > sometime so that live broker becomes leader for both nodes.* > > > This means broker 0 was bounced or halted (e.g. by a GC, etc) before, and > > hence the leadership of partition 1 migrates to broker 1, and also it is > > still catching up after the bounce since it is not in isr for any > > partitions yet. In this case, when you bounce broker 1, broker 0 which is > > not in ISR will be selected as the new leader for both and hence cause > data > > loss. > > > > If you are doing experiments on rolling bounce of say N replication > factor, > > one thing to do is wait for the isr to have at least 2 brokers before > > bouncing the next one, otherwise data loss will not be guaranteed even if > > number of replicas is larger than 2. > > > > > > > > *Yes, i have tried that after broker-0 was restarted wait for sometime so > that it comes into isr list. Checked the isr status which is:topic: > test-trunk111 partition: 0 leader: 1 replicas: 1,0 isr: 0,1 > topic: test-trunk111 partition: 1 leader: 1 replicas: 0,1 isr: > 0,1* > *Now start producing the data and kill the broker 1 and observed the > behavior. There is still data loss. In this case both brokers are in isr > list. I also experienced a little different behavior in this case that is > there is less data loss in comparison to other case where only one broker > is in isr list. In first case where only one broker is in isr list i > experienced 50-60 % data loss where is this case where both 2 brokers are > in isr list i experienced only 2-3 % data loss.* > > > > If you want to read more I would recommend this blog about Kafka's > > guarantee: > > > > > http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen > > > > Guozhang > > > > > > > > > > On Sun, Dec 22, 2013 at 10:38 PM, Hanish Bansal < > > hanish.bansal.agar...@gmail.com> wrote: > > > > > Hi Guazhang, > > > > > > When both nodes are alive then topic isr status is: > > > > > > topic: test-trunk111 partition: 0 leader: 0 replicas: 1,0 > > isr: > > > 0 > > > topic: test-trunk111 partition: 1 leader: 0 replicas: 0,1 > > isr: > > > 0 > > > > > > Now as the leader node is broker-0 so when i am producing the data then > > > meanwhile kill the leader node. > > > After leader goes down, topic isr status is: > > > > > > topic: test-trunk111 partition: 0 leader: 1 replicas: 1,0 > > isr: > > > 1 > > > topic: test-trunk111 partition: 1 leader: 1 replicas: 0,1 > > isr: > > > 1 > > > > > > Now after all data produced when i consumed the data, there is some > data > > > loss. > > > > > > *Also in controller logs there is entry like:* > > > > > > [2013-12-23 10:25:07,648] DEBUG [OfflinePartitionLeaderSelector]: No > > broker > > > in ISR is alive for [test-trunk111,1]. Pick the leader from the alive > > > assigned replicas: 1 (kafka.controller.OfflinePartitionLeaderSelector) > > > [2013-12-23 10:25:07,648] WARN [OfflinePartitionLeaderSelector]: No > > broker > > > in ISR is alive for [test-trunk111,1]. Elect leader 1 from live brokers > > 1. > > > There's potential data loss. > > > (kafka.controller.OfflinePartitionLeaderSelector) > > > [2013-12-23 10:25:07,649] INFO [OfflinePartitionLeaderSelector]: > Selected > > > new leader and ISR {"leader":1,"leader_epoch":1,"isr":[1]} for offline > > > partition [test-trunk111,1] > > > (kafka.controller.OfflinePartitionLeaderSelector) > > > > > > Is there any solution for this behaviour ? > > > > > > > > > On Fri, Dec 20, 2013 at 7:27 PM, Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > > > Hanish, > > > > > > > > One thing you can check is when you kill one of the brokers, is the > > other > > > > broker on the ISR last of the partition that killed broker is > hosting. > > > This > > > > can be done using the kafka-topics tool. > > > > > > > > Also you can check if the controller log if there is any entry like > "No > > > > broker in ISR is alive for %s. Elect leader %d from live brokers %s. > > > > There's potential data loss." > > > > > > > > Guozhang > > > > > > > > > > > > On Fri, Dec 20, 2013 at 9:11 AM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > > > Could you reproduce this easily? If so, could you file a jira and > > > > describe > > > > > the steps? > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > > > > > > On Thu, Dec 19, 2013 at 9:41 PM, Hanish Bansal < > > > > > hanish.bansal.agar...@gmail.com> wrote: > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > I have tried with Kafka-0.8.1 after applying patch 1188 but thats > > not > > > > > > helping in this case. > > > > > > > > > > > > Also controlled.shutdown.enable is also not helpful in case of > > > > abnormally > > > > > > shutdown (i.e. SIGKILL (-9)). > > > > > > > > > > > > Any other suggestion? > > > > > > > > > > > > > > > > > > On Thu, Dec 19, 2013 at 3:59 PM, Guozhang Wang < > wangg...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > Yes, please go ahead. > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 19, 2013 at 2:30 AM, Hanish Bansal < > > > > > > > hanish.bansal.agar...@gmail.com> wrote: > > > > > > > > > > > > > > > Hi Guazhang, > > > > > > > > > > > > > > > > Can I try it now with trunk HEAD kafka code after applying > the > > > > patch > > > > > > > > KAFKA-1188.patch< > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/secure/attachment/12619475/KAFKA-1188.patch > > > > > > > > >? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2013 at 9:49 PM, Guozhang Wang < > > > wangg...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Kafka server's shutdown hook should capture all SIG but NOT > > > > SIGKILL > > > > > > > (-9), > > > > > > > > > and the controlled shut down process will not be triggered > in > > > > this > > > > > > > case. > > > > > > > > > > > > > > > > > > That said, if the other replica is in ISR then even kill -9 > > > > should > > > > > > not > > > > > > > > > lose data. I am currently working on this JIRA that might > be > > > > > related > > > > > > if > > > > > > > > > brokers are bounced iteratively: > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-1188 > > > > > > > > > > > > > > > > > > Hanish, could you retry trunk HEAD once this one is > resolved? > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2013 at 12:00 PM, Joe Stein < > > > > joe.st...@stealth.ly> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > leader election should start for the brokers that are in > > the > > > > isr > > > > > > for > > > > > > > > the > > > > > > > > > > partitions that are on that replica that are leaders by > the > > > > other > > > > > > > > > replicas > > > > > > > > > > still in the isr, and the leader failed removed from the > > isr. > > > > > The > > > > > > > isr > > > > > > > > > will > > > > > > > > > > shrink for all other partitions this broker is in the isr > > on > > > > but > > > > > > not > > > > > > > > the > > > > > > > > > > leader. > > > > > > > > > > > > > > > > > > > > so lots of re-giggling and the time there is going to be > > > > related > > > > > to > > > > > > > how > > > > > > > > > > many partitions and brokers you have. > > > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2013 at 2:49 PM, Robert Rodgers < > > > > > > rsrodg...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > what happens if the physical machine dies or the kernel > > > > panics? > > > > > > > > > > > > > > > > > > > > > > On Dec 18, 2013, at 9:44 AM, Hanish Bansal < > > > > > > > > > > > hanish.bansal.agar...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > Yup definitely i would like to try that If > > > > > > > > controlled.shutdown.enable > > > > > > > > > > > > property works in case of kill -9. > > > > > > > > > > > > > > > > > > > > > > > > I hope that this option will be perfect. > > > > > > > > > > > > > > > > > > > > > > > > Thanks for quick response, really appreciate it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2013 at 10:52 PM, Joe Stein < > > > > > > > joe.st...@stealth.ly> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > >> Wouldn't you want to set the > > controlled.shutdown.enable > > > to > > > > > > true > > > > > > > so > > > > > > > > > the > > > > > > > > > > > >> broker would do this for you before ending itself? > > > > > > > > > > > >> > > > > > > > > > > > >> /******************************************* > > > > > > > > > > > >> Joe Stein > > > > > > > > > > > >> Founder, Principal Consultant > > > > > > > > > > > >> Big Data Open Source Security LLC > > > > > > > > > > > >> http://www.stealth.ly > > > > > > > > > > > >> Twitter: @allthingshadoop < > > > > > > > http://www.twitter.com/allthingshadoop > > > > > > > > > > > > > > > > > > > > >> ********************************************/ > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> On Wed, Dec 18, 2013 at 11:36 AM, pushkar > priyadarshi > > < > > > > > > > > > > > >> priyadarshi.push...@gmail.com> wrote: > > > > > > > > > > > >> > > > > > > > > > > > >>> my doubt was they are dropping off at producer > level > > > > > only.so > > > > > > > > > > suggested > > > > > > > > > > > >>> playing with paramaters like retries and > > backoff.msand > > > > > also > > > > > > > > with > > > > > > > > > > > >>> refreshinterval on producer side. > > > > > > > > > > > >>> > > > > > > > > > > > >>> Regards, > > > > > > > > > > > >>> Pushkar > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> On Wed, Dec 18, 2013 at 10:01 PM, Guozhang Wang < > > > > > > > > > wangg...@gmail.com> > > > > > > > > > > > >>> wrote: > > > > > > > > > > > >>> > > > > > > > > > > > >>>> Hanish, > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> Did you "kill -9" one of the brokers only or > > bouncing > > > > them > > > > > > > > > > > iteratively? > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> Guozhang > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> On Wed, Dec 18, 2013 at 8:02 AM, Joe Stein < > > > > > > > > joe.st...@stealth.ly> > > > > > > > > > > > >> wrote: > > > > > > > > > > > >>>> > > > > > > > > > > > >>>>> How many replicas do you have? > > > > > > > > > > > >>>>> > > > > > > > > > > > >>>>> > > > > > > > > > > > >>>>> On Wed, Dec 18, 2013 at 8:57 AM, Hanish Bansal < > > > > > > > > > > > >>>>> hanish.bansal.agar...@gmail.com> wrote: > > > > > > > > > > > >>>>> > > > > > > > > > > > >>>>>> Hi pushkar, > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> I tried with configuring > > "message.send.max.retries" > > > > to > > > > > > 10. > > > > > > > > > > Default > > > > > > > > > > > >>>> value > > > > > > > > > > > >>>>>> for this is 3. > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> But still facing data loss. > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> On Wed, Dec 18, 2013 at 12:44 PM, pushkar > > > priyadarshi > > > > < > > > > > > > > > > > >>>>>> priyadarshi.push...@gmail.com> wrote: > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>>> You can try setting a higher value for > > > > > > > > > "message.send.max.retries" > > > > > > > > > > > >>> in > > > > > > > > > > > >>>>>>> producer config. > > > > > > > > > > > >>>>>>> > > > > > > > > > > > >>>>>>> Regards, > > > > > > > > > > > >>>>>>> Pushkar > > > > > > > > > > > >>>>>>> > > > > > > > > > > > >>>>>>> > > > > > > > > > > > >>>>>>> On Wed, Dec 18, 2013 at 5:34 PM, Hanish Bansal > < > > > > > > > > > > > >>>>>>> hanish.bansal.agar...@gmail.com> wrote: > > > > > > > > > > > >>>>>>> > > > > > > > > > > > >>>>>>>> Hi All, > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> We are having kafka cluster of 2 nodes. (using > > > 0.8.0 > > > > > > final > > > > > > > > > > > >>> release) > > > > > > > > > > > >>>>>>>> Replication Factor: 2 > > > > > > > > > > > >>>>>>>> Number of partitions: 2 > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> I have configured request.required.acks in > > > producer > > > > > > > > > > > >> configuration > > > > > > > > > > > >>>> to > > > > > > > > > > > >>>>>> -1. > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> As mentioned in documentation > > > > > > > > > > > >>>>>>>> > > > > > > > http://kafka.apache.org/documentation.html#producerconfigs, > > > > > > > > > > > >>>> setting > > > > > > > > > > > >>>>>> this > > > > > > > > > > > >>>>>>>> value to -1 provides guarantee that no > messages > > > will > > > > > be > > > > > > > > lost. > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> I am getting below behaviour: > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> If kafka is running as foreground process and > i > > am > > > > > > > shutting > > > > > > > > > > > >> down > > > > > > > > > > > >>>> the > > > > > > > > > > > >>>>>>> kafka > > > > > > > > > > > >>>>>>>> leader node using "ctrl+C" then no data is > lost. > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> But if i abnormally terminate the kafka using > > > "kill > > > > -9 > > > > > > > > <pid>" > > > > > > > > > > > >>> then > > > > > > > > > > > >>>>>> still > > > > > > > > > > > >>>>>>>> facing data loss even after configuring > > > > > > > > request.required.acks > > > > > > > > > > > >> to > > > > > > > > > > > >>>> -1. > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>>> Any suggestions? > > > > > > > > > > > >>>>>>>> -- > > > > > > > > > > > >>>>>>>> *Thanks & Regards* > > > > > > > > > > > >>>>>>>> *Hanish Bansal* > > > > > > > > > > > >>>>>>>> > > > > > > > > > > > >>>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> -- > > > > > > > > > > > >>>>>> *Thanks & Regards* > > > > > > > > > > > >>>>>> *Hanish Bansal* > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>> > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> -- > > > > > > > > > > > >>>> -- Guozhang > > > > > > > > > > > >>>> > > > > > > > > > > > >>> > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > *Thanks & Regards* > > > > > > > > > > > > *Hanish Bansal* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *Thanks & Regards* > > > > > > > > *Hanish Bansal* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > *Thanks & Regards* > > > > > > *Hanish Bansal* > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > > > > -- > > > *Thanks & Regards* > > > *Hanish Bansal* > > > > > > > > > > > -- > > -- Guozhang > > > > > > -- > *Thanks & Regards* > *Hanish Bansal* > -- -- Guozhang