Re: ISR not a replica

Guozhang Wang Fri, 10 Jul 2015 15:35:13 -0700

OK, it seems your have a controller migration some time ago and the old
controller (broker 0) did not de-register its listeners while its
controller modules like "partition state machine" has been already
shutdown. You can try to verify this through the active-controller metrics.


If that is the case, you can try bounce the old controller broker, and
re-try the admin tool to see if it works now.

There are a couple of known bugs on the older version of Kafka which can
cause re-signed controllers to not de-register their ZK listeners, which
version are you using? I suggest upgrading to the latest version and see if
those issues go away.

Guozhang

On Fri, Jul 10, 2015 at 11:03 AM, Krishna Kumar <kku...@nanigans.com> wrote:

> Yes, there were messages in the controller logs such as
>
> DEBUG [OfflinePartitionLeaderSelector]: No broker in ISR is alive for
> [topic1,2]. Pick the leader from the alive assigned replicas:
> (kafka.controller.OfflinePartitionLeaderSelector)
>
> ERROR [Partition state machine on Controller 0]: Error while moving some
> partitions to NewPartition state (kafka.controller.PartitionStateMachine)
> kafka.common.StateChangeFailedException: Controller 0 epoch 0 initiated
> state change for partition [topic1,17] to NewPartition failed because the
> partition state machine has not started
>
> ERROR [AddPartitionsListener on 0]: Error while handling add partitions
> for data path /brokers/topics/topic1
> (kafka.controller.PartitionStateMachine$AddPartitionsListener)
> java.util.NoSuchElementException: key not found: [topic1,17]
>
> INFO [Controller 0]: List of topics ineligible for deletion: topic1
>
>
>
> Quite a lot of these actually
>
>
>
> On 7/10/15, 1:44 PM, "Guozhang Wang" <wangg...@gmail.com> wrote:
>
> >Krish,
> >
> >If you only add a new broker (for example broker 3) into your cluster
> >without doing anything else, this broker will not automatically get any
> >topic-partitions migrated to itself, so I suspect there are at least some
> >admin tools executed.
> >
> >The log exceptions you showed in the previous emails come from the server
> >logs, could you also check the controller logs (on broker 1 in your
> >scenario) and see if there are any exceptions / errors?
> >
> >Guozhang
> >
> >On Fri, Jul 10, 2015 at 8:09 AM, Krishna Kumar <kku...@nanigans.com>
> >wrote:
> >
> >> So we think we have a process to fix this issue via ZooKeeper  If
> >>anyone
> >> has any thoughts, please let me know.
> >>
> >> First get the “state” from a good partition, to get the correct epochs:
> >>
> >> In /usr/local/zookeeper/zkCli.sh
> >>
> >> [zk: localhost:2181(CONNECTED) 4] get
> >> /brokers/topics/topic1/partitions/6/state
> >>
> >>
> >>
> >>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,
> >>0,1]}
> >>
> >> Then, as long as we are sure those brokers have replicas, we set this
> >>onto
> >> the ‘stuck’ partition (6 is unstuck, 4 is stuck):
> >>
> >> set /brokers/topics/topic1/partitions/4/state
> >>
> >>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,
> >>0,1]}
> >>
> >> And run the rebalance for that partition only:
> >>
> >> su java -c "/usr/local/kafka/bin/kafka-preferred-replica-election.sh
> >> --zookeeper localhost:2181 --path-to-json /tmp/topic1.json"
> >>
> >> Json file:
> >>
> >> {
> >> "version":1,
> >> "partitions":[{"topic”:"topic1","partition":4}]
> >> }
> >>
> >>
> >> On 7/9/15, 8:32 PM, "Krishna Kumar" <kku...@nanigans.com<mailto:
> >> kku...@nanigans.com>> wrote:
> >>
> >> Well, 3 (the new node) was shut down, so there were no messages there.
> >>“1"
> >> was the leader and we saw the messages on “0” and “2”.
> >>
> >> We managed to resolve this new problem to an extent by shutting down
> >>“1".
> >> We were worried because “1” was the only replica in the ISR. But once it
> >> went down, “0” and “2” entered the ISR. Then on bringing back “1”, it
> >>too
> >> added itself to ISR.
> >>
> >> We still see a few partitions in some topics that do not have all the
> >> replicas in the ISR. Hopefully, that resolves itself over the next few
> >> hours.
> >>
> >> But finally we are the same spot we were earlier. There are partitions
> >> with Leader “3” although “3” is not one of the replicas, and none of the
> >> replicas are in the ISR. We want to remove “3” as a leader and get the
> >> others working. Not sure what our options are.
> >>
> >>
> >>
> >> On 7/9/15, 8:24 PM, "Guozhang Wang" <wangg...@gmail.com<mailto:
> >> wangg...@gmail.com>> wrote:
> >>
> >> Krish,
> >>
> >> Does broker 0 and 3 have the similar warn log entries as broker 2 for
> >> stale
> >> controller epochs?
> >>
> >> Guozhang
> >>
> >> On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar
> >><kku...@nanigans.com<mailto:
> >> kku...@nanigans.com>> wrote:
> >>
> >> So we tried taking that node down. But that didn¹t fix the issue, so we
> >> restarted the other nodes.
> >>
> >> This seems to have lead to 2 of other replicas dropping out of the ISIR
> >> for *all* topics.
> >>
> >> Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
> >>          Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
> >> Isr: 1
> >>          Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
> >> Isr: 1
> >>          Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
> >> Isr: 1
> >>
> >>
> >> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
> >> controller 1 with correlation id 8685 since its controller epoch 21 is
> >> old. Latest known controller epoch is 89 (state.change.logger)
> >>
> >>
> >>
> >> On 7/9/15, 4:02 PM, "Krishna Kumar" <kku...@nanigans.com<mailto:
> >> kku...@nanigans.com>> wrote:
> >>
> >> >Thanks Guozhang
> >> >
> >> >We did do the partition-assignment, but against another topic, and that
> >> >went well.
> >> >
> >> >But this happened for this topic without doing anything.
> >> >
> >> >Regards
> >> >Krish
> >> >
> >> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wangg...@gmail.com<mailto:
> >> wangg...@gmail.com>> wrote:
> >> >
> >> >>Krishna,
> >> >>
> >> >>Did you run any admin tools after adding the node (I assume it is node
> >> >>3),
> >> >>like partition-assignment? It is shown as the only one in ISR list but
> >> >>not
> >> >>in the replica list, which seems that the partition migration process
> >> was
> >> >>not completed.
> >> >>
> >> >>You can verify if this is the case by checking your controller log and
> >> >>see
> >> >>if there are any exception / error entries.
> >> >>
> >> >>Guozhang
> >> >>
> >> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kku...@nanigans.com
> >> <mailto:kku...@nanigans.com>>
> >> >>wrote:
> >> >>
> >> >>> Hi
> >> >>>
> >> >>> We added a Kafka node and it suddenly became the leader and the sole
> >> >>> replica for some partitions, but it is not in the ISR
> >> >>>
> >> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
> >> >>>
> >> >>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
> >> >>>2,0,1
> >> >>>         Topic: topic1 Partition: 1      Leader: 3       Replicas:
> >> 0,2,1
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 2      Leader: 3       Replicas:
> >> 1,0,2
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 3      Leader: 2       Replicas:
> >> 2,0,1
> >> >>> Isr: 2,0,1
> >> >>>         Topic: topic1 Partition: 4      Leader: 3       Replicas:
> >> 0,1,2
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 5      Leader: 1       Replicas:
> >> 1,2,0
> >> >>> Isr: 1,2,0
> >> >>>         Topic: topic1 Partition: 6      Leader: 3       Replicas:
> >> 2,1,0
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 7      Leader: 0       Replicas:
> >> 0,2,1
> >> >>> Isr: 0,1,2
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> >
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>--
> >> >>-- Guozhang
> >> >
> >>
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >>
> >>
> >
> >
> >--
> >-- Guozhang
>
>


-- 
-- Guozhang

Re: ISR not a replica

Reply via email to