Hey Calvin, I apologize for not being able to get to this sooner. I don't think I can reproduce the full scenario exactly as I don't have exclusive access to so many machines, but I tried it locally and couldn't reproduce it. Any chance you can reproduce it with a smaller deployment? Is step 6 required? Would you mind pasting the full stack trace that you saw?
Thanks, Joel On Wed, Jul 10, 2013 at 11:10 PM, Joel Koshy <jjkosh...@gmail.com> wrote: > Ok thanks - I'll go through this tomorrow. > > Joel > > On Wed, Jul 10, 2013 at 9:14 PM, Calvin Lei <ckp...@gmail.com> wrote: >> Joel, >> So i was able to reproduce the issue that I experienced. Please see the >> steps below. >> 1. Set up a 3-zookeeper and 6-broker cluster. Setup one topic with 2 >> partitions, with replication factor set to 3. >> 2. Setup and run the console consumer, consuming messages from that topic. >> 3. Produce a few messages to confirm the consumer is working. >> 4. Stop the consumer. >> 5. Shutdown (uncontrolled) the lead broker in one of the partition. >> 6. Shutdown one of the zookeeper. >> 7. Run the list topic script to confirm a new leader has been elected >> 8. Bring up the console consumer again. >> 9. Console consumer won't start because of error in rebalancing (when >> fetching topic metadata). >> Error: Java.util.NoSuchElementException: Key Not Found (5). >> Trace: Client.Util.Scala:67 >> >> Where broker 5 was the lead broker I shut down. I am using 0.8 beta. >> >> thanks, >> Cal >> >> >> On Tue, Jul 9, 2013 at 11:20 PM, Calvin Lei <ckp...@gmail.com> wrote: >> >>> I will try to reproduce it. it was sporadic. My set up was a topic with 1 >>> partition and replication factor = 3. >>> If i kill the console producer and then shut down the leader broker, a new >>> leader is elected. If I again kill the new lead, I dont see the last broker >>> be elected as a leader. Then i tried starting the console producer, i >>> started seeing errors. >>> >>> >>> >>> >>> On Tue, Jul 9, 2013 at 6:14 PM, Joel Koshy <jjkosh...@gmail.com> wrote: >>> >>>> Not really - if you shutdown a leader broker (and assuming your >>>> replication factor is > 1) then the other assigned replica will be >>>> elected as the new leader. The producer would then look up metadata, >>>> find the new leader and send requests to it. What do you see in the >>>> logs? >>>> >>>> Joel >>>> >>>> On Tue, Jul 9, 2013 at 1:44 PM, Calvin Lei <ckp...@gmail.com> wrote: >>>> > Thanks you have me enough pointers to dig deeper. And I tested the fault >>>> > tolerance by shutting down brokers randomly. >>>> > >>>> > What I noticed is if I shutdown brokers while my producer and consumer >>>> are >>>> > still running, they recover fine. However, if I shutdown a lead broker >>>> > without a running producer, I can't seem to start the producer >>>> afterwards >>>> > without restarting the previous lead broker. Is this expected? >>>> > On Jul 9, 2013 10:28 AM, "Joel Koshy" <jjkosh...@gmail.com> wrote: >>>> > >>>> >> For 1 I forgot to add - there is an admin tool to reassign replicas >>>> but it >>>> >> would take longer than leader failover. >>>> >> >>>> >> Joel >>>> >> >>>> >> On Tuesday, July 9, 2013, Joel Koshy wrote: >>>> >> >>>> >> > 1 - no, unless broker4 is not the preferred leader. (The preferred >>>> >> > leader is the first broker in the assigned replica list). If a >>>> >> > non-preferred replica is the current leader you can run the >>>> >> > PreferredReplicaLeaderElection admin command to move the leader. >>>> >> > 2 - The actual leader movement (on leader failover) is fairly low - >>>> >> > probably of the order of tens of ms. However, clients (producers, >>>> >> > consumers) may take longer to detect that (it needs to get back an >>>> >> > error response, handle an exception, issue a metadata request, get >>>> the >>>> >> > response to find the new leader, and all that can add up but it >>>> should >>>> >> > not be terribly high - I'm guessing on the order of a few hundred ms >>>> >> > to a second or so). >>>> >> > 3 - That should work, although the admin command for adding more >>>> >> > partitions to a topic is currently being developed. >>>> >> > >>>> >> > >>>> >> > On Mon, Jul 8, 2013 at 11:02 PM, Calvin Lei <ckp...@gmail.com> >>>> wrote: >>>> >> > > Hi, >>>> >> > > I have two questions regarding the kafka broker setup. >>>> >> > > >>>> >> > > 1. Assuming i have a 4-broker and 2-zookeeper (running in quorum >>>> mode) >>>> >> > > setup, if topicA-partition0 has the leader set to broker4, can I >>>> change >>>> >> > the >>>> >> > > leader to other broker without killing the current leader? >>>> >> > > >>>> >> > > 2. What is the latency of switching to a different leader when the >>>> >> > current >>>> >> > > leader is down? Do we configure it using the consumer property - >>>> >> > > refresh.leader.backoff.ms >>>> >> > > >>>> >> > > 3. What is the best practice of dynamically adding a new node to a >>>> >> kafka >>>> >> > > cluster? Should i bring up the node, and then increase the >>>> replication >>>> >> > > factor for the existing topic(s)? >>>> >> > > >>>> >> > > >>>> >> > > thanks in advance, >>>> >> > > Cal >>>> >> > >>>> >> >>>> >>> >>>