Yes, there were messages in the controller logs such as DEBUG [OfflinePartitionLeaderSelector]: No broker in ISR is alive for [topic1,2]. Pick the leader from the alive assigned replicas: (kafka.controller.OfflinePartitionLeaderSelector)
ERROR [Partition state machine on Controller 0]: Error while moving some partitions to NewPartition state (kafka.controller.PartitionStateMachine) kafka.common.StateChangeFailedException: Controller 0 epoch 0 initiated state change for partition [topic1,17] to NewPartition failed because the partition state machine has not started ERROR [AddPartitionsListener on 0]: Error while handling add partitions for data path /brokers/topics/topic1 (kafka.controller.PartitionStateMachine$AddPartitionsListener) java.util.NoSuchElementException: key not found: [topic1,17] INFO [Controller 0]: List of topics ineligible for deletion: topic1 Quite a lot of these actually On 7/10/15, 1:44 PM, "Guozhang Wang" <wangg...@gmail.com> wrote: >Krish, > >If you only add a new broker (for example broker 3) into your cluster >without doing anything else, this broker will not automatically get any >topic-partitions migrated to itself, so I suspect there are at least some >admin tools executed. > >The log exceptions you showed in the previous emails come from the server >logs, could you also check the controller logs (on broker 1 in your >scenario) and see if there are any exceptions / errors? > >Guozhang > >On Fri, Jul 10, 2015 at 8:09 AM, Krishna Kumar <kku...@nanigans.com> >wrote: > >> So we think we have a process to fix this issue via ZooKeeper If >>anyone >> has any thoughts, please let me know. >> >> First get the “state” from a good partition, to get the correct epochs: >> >> In /usr/local/zookeeper/zkCli.sh >> >> [zk: localhost:2181(CONNECTED) 4] get >> /brokers/topics/topic1/partitions/6/state >> >> >> >>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2, >>0,1]} >> >> Then, as long as we are sure those brokers have replicas, we set this >>onto >> the ‘stuck’ partition (6 is unstuck, 4 is stuck): >> >> set /brokers/topics/topic1/partitions/4/state >> >>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2, >>0,1]} >> >> And run the rebalance for that partition only: >> >> su java -c "/usr/local/kafka/bin/kafka-preferred-replica-election.sh >> --zookeeper localhost:2181 --path-to-json /tmp/topic1.json" >> >> Json file: >> >> { >> "version":1, >> "partitions":[{"topic”:"topic1","partition":4}] >> } >> >> >> On 7/9/15, 8:32 PM, "Krishna Kumar" <kku...@nanigans.com<mailto: >> kku...@nanigans.com>> wrote: >> >> Well, 3 (the new node) was shut down, so there were no messages there. >>“1" >> was the leader and we saw the messages on “0” and “2”. >> >> We managed to resolve this new problem to an extent by shutting down >>“1". >> We were worried because “1” was the only replica in the ISR. But once it >> went down, “0” and “2” entered the ISR. Then on bringing back “1”, it >>too >> added itself to ISR. >> >> We still see a few partitions in some topics that do not have all the >> replicas in the ISR. Hopefully, that resolves itself over the next few >> hours. >> >> But finally we are the same spot we were earlier. There are partitions >> with Leader “3” although “3” is not one of the replicas, and none of the >> replicas are in the ISR. We want to remove “3” as a leader and get the >> others working. Not sure what our options are. >> >> >> >> On 7/9/15, 8:24 PM, "Guozhang Wang" <wangg...@gmail.com<mailto: >> wangg...@gmail.com>> wrote: >> >> Krish, >> >> Does broker 0 and 3 have the similar warn log entries as broker 2 for >> stale >> controller epochs? >> >> Guozhang >> >> On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar >><kku...@nanigans.com<mailto: >> kku...@nanigans.com>> wrote: >> >> So we tried taking that node down. But that didn¹t fix the issue, so we >> restarted the other nodes. >> >> This seems to have lead to 2 of other replicas dropping out of the ISIR >> for *all* topics. >> >> Topic: topic2 Partition: 0 Leader: 1 Replicas: 1,0,2 Isr: 1 >> Topic: topic2 Partition: 1 Leader: 1 Replicas: 2,1,0 >> Isr: 1 >> Topic: topic2 Partition: 2 Leader: 1 Replicas: 0,2,1 >> Isr: 1 >> Topic: topic2 Partition: 3 Leader: 1 Replicas: 1,2,0 >> Isr: 1 >> >> >> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from >> controller 1 with correlation id 8685 since its controller epoch 21 is >> old. Latest known controller epoch is 89 (state.change.logger) >> >> >> >> On 7/9/15, 4:02 PM, "Krishna Kumar" <kku...@nanigans.com<mailto: >> kku...@nanigans.com>> wrote: >> >> >Thanks Guozhang >> > >> >We did do the partition-assignment, but against another topic, and that >> >went well. >> > >> >But this happened for this topic without doing anything. >> > >> >Regards >> >Krish >> > >> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wangg...@gmail.com<mailto: >> wangg...@gmail.com>> wrote: >> > >> >>Krishna, >> >> >> >>Did you run any admin tools after adding the node (I assume it is node >> >>3), >> >>like partition-assignment? It is shown as the only one in ISR list but >> >>not >> >>in the replica list, which seems that the partition migration process >> was >> >>not completed. >> >> >> >>You can verify if this is the case by checking your controller log and >> >>see >> >>if there are any exception / error entries. >> >> >> >>Guozhang >> >> >> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kku...@nanigans.com >> <mailto:kku...@nanigans.com>> >> >>wrote: >> >> >> >>> Hi >> >>> >> >>> We added a Kafka node and it suddenly became the leader and the sole >> >>> replica for some partitions, but it is not in the ISR >> >>> >> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2 >> >>> >> >>> Topic: topic1 Partition: 0 Leader: 2 Replicas: 2,1,0 Isr: >> >>>2,0,1 >> >>> Topic: topic1 Partition: 1 Leader: 3 Replicas: >> 0,2,1 >> >>> Isr: 3 >> >>> Topic: topic1 Partition: 2 Leader: 3 Replicas: >> 1,0,2 >> >>> Isr: 3 >> >>> Topic: topic1 Partition: 3 Leader: 2 Replicas: >> 2,0,1 >> >>> Isr: 2,0,1 >> >>> Topic: topic1 Partition: 4 Leader: 3 Replicas: >> 0,1,2 >> >>> Isr: 3 >> >>> Topic: topic1 Partition: 5 Leader: 1 Replicas: >> 1,2,0 >> >>> Isr: 1,2,0 >> >>> Topic: topic1 Partition: 6 Leader: 3 Replicas: >> 2,1,0 >> >>> Isr: 3 >> >>> Topic: topic1 Partition: 7 Leader: 0 Replicas: >> 0,2,1 >> >>> Isr: 0,1,2 >> >>> >> >>> >> >>> >> >>> >> >>> > >> >>> >> >>> >> >> >> >> >> >>-- >> >>-- Guozhang >> > >> >> >> >> >> -- >> -- Guozhang >> >> >> > > >-- >-- Guozhang