thanks Joel for looking into it. I will try to reproduce it. I don't think the second zookeeper is needed because i ran into it the first time just by shutting down the topic leaders.
Cal On Tue, Jul 16, 2013 at 2:38 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > Hey Calvin, > > I apologize for not being able to get to this sooner. I don't think I > can reproduce the full scenario exactly as I don't have exclusive > access to so many machines, but I tried it locally and couldn't > reproduce it. Any chance you can reproduce it with a smaller > deployment? Is step 6 required? Would you mind pasting the full stack > trace that you saw? > > Thanks, > > Joel > > > > > On Wed, Jul 10, 2013 at 11:10 PM, Joel Koshy <jjkosh...@gmail.com> wrote: > > Ok thanks - I'll go through this tomorrow. > > > > Joel > > > > On Wed, Jul 10, 2013 at 9:14 PM, Calvin Lei <ckp...@gmail.com> wrote: > >> Joel, > >> So i was able to reproduce the issue that I experienced. Please see > the > >> steps below. > >> 1. Set up a 3-zookeeper and 6-broker cluster. Setup one topic with 2 > >> partitions, with replication factor set to 3. > >> 2. Setup and run the console consumer, consuming messages from that > topic. > >> 3. Produce a few messages to confirm the consumer is working. > >> 4. Stop the consumer. > >> 5. Shutdown (uncontrolled) the lead broker in one of the partition. > >> 6. Shutdown one of the zookeeper. > >> 7. Run the list topic script to confirm a new leader has been elected > >> 8. Bring up the console consumer again. > >> 9. Console consumer won't start because of error in rebalancing (when > >> fetching topic metadata). > >> Error: Java.util.NoSuchElementException: Key Not Found (5). > >> Trace: Client.Util.Scala:67 > >> > >> Where broker 5 was the lead broker I shut down. I am using 0.8 beta. > >> > >> thanks, > >> Cal > >> > >> > >> On Tue, Jul 9, 2013 at 11:20 PM, Calvin Lei <ckp...@gmail.com> wrote: > >> > >>> I will try to reproduce it. it was sporadic. My set up was a topic > with 1 > >>> partition and replication factor = 3. > >>> If i kill the console producer and then shut down the leader broker, a > new > >>> leader is elected. If I again kill the new lead, I dont see the last > broker > >>> be elected as a leader. Then i tried starting the console producer, i > >>> started seeing errors. > >>> > >>> > >>> > >>> > >>> On Tue, Jul 9, 2013 at 6:14 PM, Joel Koshy <jjkosh...@gmail.com> > wrote: > >>> > >>>> Not really - if you shutdown a leader broker (and assuming your > >>>> replication factor is > 1) then the other assigned replica will be > >>>> elected as the new leader. The producer would then look up metadata, > >>>> find the new leader and send requests to it. What do you see in the > >>>> logs? > >>>> > >>>> Joel > >>>> > >>>> On Tue, Jul 9, 2013 at 1:44 PM, Calvin Lei <ckp...@gmail.com> wrote: > >>>> > Thanks you have me enough pointers to dig deeper. And I tested the > fault > >>>> > tolerance by shutting down brokers randomly. > >>>> > > >>>> > What I noticed is if I shutdown brokers while my producer and > consumer > >>>> are > >>>> > still running, they recover fine. However, if I shutdown a lead > broker > >>>> > without a running producer, I can't seem to start the producer > >>>> afterwards > >>>> > without restarting the previous lead broker. Is this expected? > >>>> > On Jul 9, 2013 10:28 AM, "Joel Koshy" <jjkosh...@gmail.com> wrote: > >>>> > > >>>> >> For 1 I forgot to add - there is an admin tool to reassign replicas > >>>> but it > >>>> >> would take longer than leader failover. > >>>> >> > >>>> >> Joel > >>>> >> > >>>> >> On Tuesday, July 9, 2013, Joel Koshy wrote: > >>>> >> > >>>> >> > 1 - no, unless broker4 is not the preferred leader. (The > preferred > >>>> >> > leader is the first broker in the assigned replica list). If a > >>>> >> > non-preferred replica is the current leader you can run the > >>>> >> > PreferredReplicaLeaderElection admin command to move the leader. > >>>> >> > 2 - The actual leader movement (on leader failover) is fairly > low - > >>>> >> > probably of the order of tens of ms. However, clients (producers, > >>>> >> > consumers) may take longer to detect that (it needs to get back > an > >>>> >> > error response, handle an exception, issue a metadata request, > get > >>>> the > >>>> >> > response to find the new leader, and all that can add up but it > >>>> should > >>>> >> > not be terribly high - I'm guessing on the order of a few > hundred ms > >>>> >> > to a second or so). > >>>> >> > 3 - That should work, although the admin command for adding more > >>>> >> > partitions to a topic is currently being developed. > >>>> >> > > >>>> >> > > >>>> >> > On Mon, Jul 8, 2013 at 11:02 PM, Calvin Lei <ckp...@gmail.com> > >>>> wrote: > >>>> >> > > Hi, > >>>> >> > > I have two questions regarding the kafka broker setup. > >>>> >> > > > >>>> >> > > 1. Assuming i have a 4-broker and 2-zookeeper (running in > quorum > >>>> mode) > >>>> >> > > setup, if topicA-partition0 has the leader set to broker4, can > I > >>>> change > >>>> >> > the > >>>> >> > > leader to other broker without killing the current leader? > >>>> >> > > > >>>> >> > > 2. What is the latency of switching to a different leader when > the > >>>> >> > current > >>>> >> > > leader is down? Do we configure it using the consumer property > - > >>>> >> > > refresh.leader.backoff.ms > >>>> >> > > > >>>> >> > > 3. What is the best practice of dynamically adding a new node > to a > >>>> >> kafka > >>>> >> > > cluster? Should i bring up the node, and then increase the > >>>> replication > >>>> >> > > factor for the existing topic(s)? > >>>> >> > > > >>>> >> > > > >>>> >> > > thanks in advance, > >>>> >> > > Cal > >>>> >> > > >>>> >> > >>>> > >>> > >>> >