Hi All, Just a quick update to say that I've overcome this issue by temporarily enabling the unclean.leader.election.enable=false property on the __consumer_offsets topic, performing a rolling restart of the cluster and then applying a new reassignment config to the two partitions that were missing a leader. And then finally setting the probably back to unclean.leader.election.enable=true
Thank you again for your help and advice. Kindest regards, Tom On Wed, 11 Jan 2023 at 10:59, Tom Bolitho <tboli...@gmail.com> wrote: > Just an update on the 'Leader: none' issue for one of the partitions of my > __consumer_offsets topic, I have tried deleting all of the > partition.metadata files relating to the __consumer_offsets topic on all of > nodes in the cluster. I have then restarted each node in the cluster. > Unfortunately, the issue still persists; partition 7 and 11 for the > __consumer_offsets still have no leader. The replicas and Isr will also not > update by running kafka-reassign-partitions.sh. The replica and Isr > is stuck on a node that we are looking to decommission but which is still > running and part of the cluster. > > I have attempted to increase the ReplicationFactor of the > __consumer_offsets topic to 3 and move the replicas to a new cluster > (1,2,3,4) by running the kafka-reassign-partitions.sh on that topic .e.g > kafka-reassign-partitons.sh --bootstrap-server localhost:9092 > --reassignment-json-file /<reassignmentfilename>.json --execute > > This has been partially successful. All but two of the partitions for the > __consumer_offsets topic have reassigned their replicas to the new cluster > and updated the leader. However, the reassignment of partition 7 and 11 > just hangs and the Replica and Isr for those partiton is stuck on node 5 > that we are looking to decommission. When running a --verify on the > reassignment task the status of that partition reassignment is: > > with 'Reassignment of partition __consumer_offsets-7 is still in > progress' > > The only option is to then --cancel the reassignment. > > Below is the --describe output of two of the partitions for the > __consumer_offsets topic. As you can see, Partition 6 has successfully > updated the replication factor to 3, assigned the replicas from node 5 to > the new cluster and assigned the leader for that partition as the first > replica. However, Partition 7 is still stuck with only one replica of node > 5 that we are looking to decommission, and it also has no leader. > > Topic: __consumer_offsets Partition: 6 Leader: 3 > Replicas: 3,4,1 Isr: 3,4,1 > Topic: __consumer_offsets Partition: 7 Leader: None Replicas: 5 > Isr: 5 > > It's worth noting that this partition 7 has been in this state on the old > cluster for a while and is not having a noticeable performance impact. > However, we are looking to decommission node 5 and we really need to full > migrate the __consumer_offsets topic, and all its partitions to the new > cluster before we decommission it. > > Thank you for your ideas and input so far. it's been very much > appreciated. If anyone else has any other ideas on how I can re-assign the > remaining replicas of this unhappy partition to our new cluster and > re-establish a leader I'd be grateful for any kind of steer. > > As some extra info, I have checked all of the topic_id's in all of the > partition.metadata relating to the problematic topic and they are all the > same. > > Is the last resort to delete the __consumer_offsets? Or will this cause > data loss? This is unfortunately a Production system. > > Many thanks, > > Tom > > > > > > > > On Tue, 10 Jan 2023 at 13:58, Tom Bolitho <tboli...@gmail.com> wrote: > >> Hi Megh >> >> Many thanks for taking the time to get back to me. It sounds like we've >> had a similar issue although I've checked all of the topic_id's in all of >> the partition.metadata relating to the problematic topic __consumer_offsets >> (e.g. grep -r 'topic_id' /data/*/kafka/data/__consumer_offsets-*) and all >> of the topic ID's are the same on every partition, on all of the nodes for >> that topic. It sounds like some of your topic_id's were different and >> therefore you've got a slightly different issue and resolution? >> >> Kind regards, >> >> Tom >> >> On Tue, 10 Jan 2023 at 02:08, megh vidani <vidanimeg...@gmail.com> wrote: >> >>> Hi Tom, >>> >>> We faced similar problem wherein there was an issue with isr and we were >>> also getting NotLeaderOrFollowerException on consumer end. Also, it was >>> not >>> getting fixed automatically with broker restarts. >>> >>> We eventually found out that the topicId for a few partitions in the >>> topic >>> (in the partition.metadata file) was different from the actual topicId in >>> zookeeper. I'd suggest you to check that as well. >>> >>> The way we fixed it was to remove the partition.metadata file (only this >>> file alone!!) from all the partition directories of that topic and then >>> restarting the brokers. This was the safest option we found as it doesn't >>> incur any data loss. Before figuring this out we used to delete and >>> re-create the topic which resulted into data being lost. >>> >>> Hope this helps. >>> >>> Thanks, >>> Megh >>> >>> On Mon, 9 Jan 2023, 22:28 Tom Bolitho, <tboli...@gmail.com> wrote: >>> >>> > Dear Kafka Community, >>> > >>> > I'm hoping you can help with kafka topic partition that is missing a >>> > leader. The topic in question is the '__consumer_offsets' topic >>> > >>> > The output of a '--describe' on that topic looks like: >>> > >>> > Topic: __consumer_offsets Partition: 7 Leader: none Replicas 5 >>> Isr: >>> > 5 >>> > Topic: __consumer_offsets Partition: 11 Leader: none Replicas 5 >>> Isr: >>> > 5 >>> > >>> > The other 48 partitions are all ok and have an assigned leader (some >>> with 5 >>> > as the leader). >>> > >>> > I have tried running a --reassignment-json-file against the topic .e.g >>> > >>> > kafka-reassign-partitons.sh --bootstrap-server localhost:9092 >>> > --reassignment-json-file /<reassignmentfilename>.json --execute >>> > >>> > but the reassignment just hangs, with the two partitions that are >>> missing a >>> > leader reporting: >>> > 'Reassignment of partition __consumer_offsets-7 is still in progress' >>> > >>> > I've since had to --cancel that reassignment >>> > >>> > Can anyone advise on how I can overcome the issue of this missing >>> leader >>> > please? >>> > >>> > My eventual goal is to reassign this __consumer_offsets topic with a >>> > replication factor of 3 to increase resiliency now that the cluster is >>> in >>> > production. I realise we should have set the >>> > offets.topic.replication.factor to a value higher than 1 before we >>> spun up >>> > the prod cluster but this was missed so we're now looking to manually >>> > reassign the __consumer_offsets with a higher replication factor. >>> > >>> > Any advice on how to overcome this 'Leader: none' issue would be >>> greatly >>> > appreciated. >>> > >>> > Many thanks, >>> > >>> > Tom >>> > >>> >>