Just an update on the 'Leader: none' issue for one of the partitions of my
__consumer_offsets topic, I have tried deleting all of the
partition.metadata files relating to the __consumer_offsets topic on all of
nodes in the cluster. I have then restarted each node in the cluster.
Unfortunately, the issue still persists; partition 7 and 11 for the
__consumer_offsets still have no leader. The replicas and Isr will also not
update by running kafka-reassign-partitions.sh. The replica and Isr
is stuck on a node that we are looking to decommission but which is still
running and part of the cluster.

I have attempted to increase the ReplicationFactor of the
__consumer_offsets topic to 3 and move the replicas to a new cluster
(1,2,3,4) by running the kafka-reassign-partitions.sh on that topic .e.g
kafka-reassign-partitons.sh --bootstrap-server localhost:9092
--reassignment-json-file /<reassignmentfilename>.json  --execute

This has been partially successful. All but two of the partitions for the
__consumer_offsets topic have reassigned their replicas to the new cluster
and updated the leader. However, the reassignment of partition 7 and 11
just hangs and the Replica and Isr for those partiton is stuck on node 5
that we are looking to decommission. When running a  --verify on the
reassignment task the status of that partition reassignment is:

 with  'Reassignment of partition __consumer_offsets-7 is still in progress'

The only option is to then --cancel the reassignment.

Below is the --describe output of two of the partitions for the
__consumer_offsets topic. As you can see, Partition 6 has successfully
updated the replication factor to 3, assigned the replicas from node 5 to
the new cluster and assigned the leader for that partition as the first
replica. However, Partition 7 is still stuck with only one replica of node
5 that we are looking to decommission, and it also has no leader.

Topic: __consumer_offsets      Partition: 6   Leader: 3           Replicas:
3,4,1      Isr: 3,4,1
Topic: __consumer_offsets      Partition: 7   Leader: None    Replicas: 5
    Isr: 5

It's worth noting that this partition 7 has been in this state on the old
cluster for a while and is not having a noticeable performance impact.
However, we are looking to decommission node 5 and we really need to full
migrate the __consumer_offsets topic, and all its partitions to the new
cluster before we decommission it.

Thank you for your ideas and input so far. it's been very much appreciated.
If anyone else has any other ideas on how I can re-assign the remaining
replicas of this unhappy partition to our new cluster and re-establish a
leader I'd be grateful for any kind of steer.

 As some extra info, I have checked all of the topic_id's in all of the
partition.metadata relating to the problematic topic and they are all the
same.

Is the last resort to delete the __consumer_offsets? Or will this cause
data loss? This is unfortunately a Production system.

Many thanks,

Tom







On Tue, 10 Jan 2023 at 13:58, Tom Bolitho <tboli...@gmail.com> wrote:

> Hi Megh
>
> Many thanks for taking the time to get back to me. It sounds like we've
> had a similar issue although I've checked all of the topic_id's in all of
> the partition.metadata relating to the problematic topic __consumer_offsets
> (e.g. grep -r 'topic_id' /data/*/kafka/data/__consumer_offsets-*) and all
> of the topic ID's are the same on every partition, on all of the nodes for
> that topic. It sounds like some of your topic_id's were different and
> therefore you've got a slightly different issue and resolution?
>
> Kind regards,
>
> Tom
>
> On Tue, 10 Jan 2023 at 02:08, megh vidani <vidanimeg...@gmail.com> wrote:
>
>> Hi Tom,
>>
>> We faced similar problem wherein there was an issue with isr and we were
>> also getting NotLeaderOrFollowerException on consumer end. Also, it was
>> not
>> getting fixed automatically with broker restarts.
>>
>> We eventually found out that the topicId for a few partitions in the topic
>> (in the partition.metadata file) was different from the actual topicId in
>> zookeeper. I'd suggest you to check that as well.
>>
>> The way we fixed it was to remove the partition.metadata file (only this
>> file alone!!) from all the partition directories of that topic and then
>> restarting the brokers. This was the safest option we found as it doesn't
>> incur any data loss. Before figuring this out we used to delete and
>> re-create the topic which resulted into data being lost.
>>
>> Hope this helps.
>>
>> Thanks,
>> Megh
>>
>> On Mon, 9 Jan 2023, 22:28 Tom Bolitho, <tboli...@gmail.com> wrote:
>>
>> > Dear Kafka Community,
>> >
>> > I'm hoping you can help with kafka topic partition that is missing a
>> > leader. The topic in question is the '__consumer_offsets' topic
>> >
>> > The output of a '--describe' on that topic looks like:
>> >
>> > Topic: __consumer_offsets   Partition: 7   Leader: none Replicas 5
>> Isr:
>> > 5
>> > Topic: __consumer_offsets   Partition: 11  Leader: none Replicas 5
>> Isr:
>> > 5
>> >
>> > The other 48 partitions are all ok and have an assigned leader (some
>> with 5
>> > as the leader).
>> >
>> > I have tried running a --reassignment-json-file against the topic .e.g
>> >
>> > kafka-reassign-partitons.sh --bootstrap-server localhost:9092
>> > --reassignment-json-file /<reassignmentfilename>.json  --execute
>> >
>> > but the reassignment just hangs, with the two partitions that are
>> missing a
>> > leader reporting:
>> > 'Reassignment of partition __consumer_offsets-7 is still in progress'
>> >
>> > I've since had to --cancel that reassignment
>> >
>> > Can anyone advise on how I can overcome the issue of this missing leader
>> > please?
>> >
>> > My eventual goal is to reassign this __consumer_offsets topic with a
>> > replication factor of 3 to increase resiliency now that the cluster is
>> in
>> > production. I realise we should have set the
>> > offets.topic.replication.factor to a value higher than 1 before we spun
>> up
>> > the prod cluster but this was missed so we're now looking to manually
>> > reassign the __consumer_offsets with a higher replication factor.
>> >
>> > Any advice on how to overcome this 'Leader: none' issue would be greatly
>> > appreciated.
>> >
>> > Many thanks,
>> >
>> > Tom
>> >
>>
>

Reply via email to