Hi All,

Just a quick update to say that I've overcome this issue by temporarily
enabling the unclean.leader.election.enable=false property on the
__consumer_offsets topic, performing a rolling restart of the cluster and
then applying a new reassignment config to the two partitions that were
missing a leader. And then finally setting the probably back to
unclean.leader.election.enable=true

Thank you again for your help and advice.

Kindest regards,

Tom

On Wed, 11 Jan 2023 at 10:59, Tom Bolitho <tboli...@gmail.com> wrote:

> Just an update on the 'Leader: none' issue for one of the partitions of my
> __consumer_offsets topic, I have tried deleting all of the
> partition.metadata files relating to the __consumer_offsets topic on all of
> nodes in the cluster. I have then restarted each node in the cluster.
> Unfortunately, the issue still persists; partition 7 and 11 for the
> __consumer_offsets still have no leader. The replicas and Isr will also not
> update by running kafka-reassign-partitions.sh. The replica and Isr
> is stuck on a node that we are looking to decommission but which is still
> running and part of the cluster.
>
> I have attempted to increase the ReplicationFactor of the
> __consumer_offsets topic to 3 and move the replicas to a new cluster
> (1,2,3,4) by running the kafka-reassign-partitions.sh on that topic .e.g
> kafka-reassign-partitons.sh --bootstrap-server localhost:9092
> --reassignment-json-file /<reassignmentfilename>.json  --execute
>
> This has been partially successful. All but two of the partitions for the
> __consumer_offsets topic have reassigned their replicas to the new cluster
> and updated the leader. However, the reassignment of partition 7 and 11
> just hangs and the Replica and Isr for those partiton is stuck on node 5
> that we are looking to decommission. When running a  --verify on the
> reassignment task the status of that partition reassignment is:
>
>  with  'Reassignment of partition __consumer_offsets-7 is still in
> progress'
>
> The only option is to then --cancel the reassignment.
>
> Below is the --describe output of two of the partitions for the
> __consumer_offsets topic. As you can see, Partition 6 has successfully
> updated the replication factor to 3, assigned the replicas from node 5 to
> the new cluster and assigned the leader for that partition as the first
> replica. However, Partition 7 is still stuck with only one replica of node
> 5 that we are looking to decommission, and it also has no leader.
>
> Topic: __consumer_offsets      Partition: 6   Leader: 3
>  Replicas: 3,4,1      Isr: 3,4,1
> Topic: __consumer_offsets      Partition: 7   Leader: None    Replicas: 5
>     Isr: 5
>
> It's worth noting that this partition 7 has been in this state on the old
> cluster for a while and is not having a noticeable performance impact.
> However, we are looking to decommission node 5 and we really need to full
> migrate the __consumer_offsets topic, and all its partitions to the new
> cluster before we decommission it.
>
> Thank you for your ideas and input so far. it's been very much
> appreciated. If anyone else has any other ideas on how I can re-assign the
> remaining replicas of this unhappy partition to our new cluster and
> re-establish a leader I'd be grateful for any kind of steer.
>
>  As some extra info, I have checked all of the topic_id's in all of the
> partition.metadata relating to the problematic topic and they are all the
> same.
>
> Is the last resort to delete the __consumer_offsets? Or will this cause
> data loss? This is unfortunately a Production system.
>
> Many thanks,
>
> Tom
>
>
>
>
>
>
>
> On Tue, 10 Jan 2023 at 13:58, Tom Bolitho <tboli...@gmail.com> wrote:
>
>> Hi Megh
>>
>> Many thanks for taking the time to get back to me. It sounds like we've
>> had a similar issue although I've checked all of the topic_id's in all of
>> the partition.metadata relating to the problematic topic __consumer_offsets
>> (e.g. grep -r 'topic_id' /data/*/kafka/data/__consumer_offsets-*) and all
>> of the topic ID's are the same on every partition, on all of the nodes for
>> that topic. It sounds like some of your topic_id's were different and
>> therefore you've got a slightly different issue and resolution?
>>
>> Kind regards,
>>
>> Tom
>>
>> On Tue, 10 Jan 2023 at 02:08, megh vidani <vidanimeg...@gmail.com> wrote:
>>
>>> Hi Tom,
>>>
>>> We faced similar problem wherein there was an issue with isr and we were
>>> also getting NotLeaderOrFollowerException on consumer end. Also, it was
>>> not
>>> getting fixed automatically with broker restarts.
>>>
>>> We eventually found out that the topicId for a few partitions in the
>>> topic
>>> (in the partition.metadata file) was different from the actual topicId in
>>> zookeeper. I'd suggest you to check that as well.
>>>
>>> The way we fixed it was to remove the partition.metadata file (only this
>>> file alone!!) from all the partition directories of that topic and then
>>> restarting the brokers. This was the safest option we found as it doesn't
>>> incur any data loss. Before figuring this out we used to delete and
>>> re-create the topic which resulted into data being lost.
>>>
>>> Hope this helps.
>>>
>>> Thanks,
>>> Megh
>>>
>>> On Mon, 9 Jan 2023, 22:28 Tom Bolitho, <tboli...@gmail.com> wrote:
>>>
>>> > Dear Kafka Community,
>>> >
>>> > I'm hoping you can help with kafka topic partition that is missing a
>>> > leader. The topic in question is the '__consumer_offsets' topic
>>> >
>>> > The output of a '--describe' on that topic looks like:
>>> >
>>> > Topic: __consumer_offsets   Partition: 7   Leader: none Replicas 5
>>> Isr:
>>> > 5
>>> > Topic: __consumer_offsets   Partition: 11  Leader: none Replicas 5
>>> Isr:
>>> > 5
>>> >
>>> > The other 48 partitions are all ok and have an assigned leader (some
>>> with 5
>>> > as the leader).
>>> >
>>> > I have tried running a --reassignment-json-file against the topic .e.g
>>> >
>>> > kafka-reassign-partitons.sh --bootstrap-server localhost:9092
>>> > --reassignment-json-file /<reassignmentfilename>.json  --execute
>>> >
>>> > but the reassignment just hangs, with the two partitions that are
>>> missing a
>>> > leader reporting:
>>> > 'Reassignment of partition __consumer_offsets-7 is still in progress'
>>> >
>>> > I've since had to --cancel that reassignment
>>> >
>>> > Can anyone advise on how I can overcome the issue of this missing
>>> leader
>>> > please?
>>> >
>>> > My eventual goal is to reassign this __consumer_offsets topic with a
>>> > replication factor of 3 to increase resiliency now that the cluster is
>>> in
>>> > production. I realise we should have set the
>>> > offets.topic.replication.factor to a value higher than 1 before we
>>> spun up
>>> > the prod cluster but this was missed so we're now looking to manually
>>> > reassign the __consumer_offsets with a higher replication factor.
>>> >
>>> > Any advice on how to overcome this 'Leader: none' issue would be
>>> greatly
>>> > appreciated.
>>> >
>>> > Many thanks,
>>> >
>>> > Tom
>>> >
>>>
>>

Reply via email to