Hi, I was expanding my kafka cluster to 24 nodes (from 16 nodes) and 
rebalancing the topics. 1 of the partitions of the topic did not get rebalanced 
as expected (it was taking a million years so I decided to look and see what 
was happening). It turns out that the script for mounting the 2nd partition for 
use as /data did not kick in on one of the nodes and thus there simply wasn't 
enough disk space available at the time of the rebalance. The system was left 
with like 5Mb of disk space and the kafka brokers were essentially broken at 
that point.

So I had to kill the kafka process (it wouldn't shutdown), move the original 
kafka data folder to a /tmp location, mounted the data partition, and migrated 
the /tmp kafka folder back to the original spot. But when I went to startup the 
kafka instance I got this message over and over again every few milliseconds.

[2024-06-03 06:14:01,503] ERROR Encountered metadata loading fault: Unhandled 
error initializing new publishers 
(org.apache.kafka.server.fault.LoggingFaultHandler) 
org.apache.kafka.image.writer.UnwritableMetadataException: Metadata has been 
lost because the following could not be represented in metadata version 
3.4-IV0: the directory assign ment state of one or more replicas at 
org.apache.kafka.image.writer.ImageWriterOptions.handleLoss(ImageWriterOptions.java:94)
 at 
org.apache.kafka.metadata.PartitionRegistration.toRecord(PartitionRegistration.java:391)
 at org.apache.kafka.image.TopicImage.write(TopicImage.java:71) at 
org.apache.kafka.image.TopicsImage.write(TopicsImage.java:84) at 
org.apache.kafka.image.MetadataImage.write(MetadataImage.java:155) at 
org.apache.kafka.image.loader.MetadataLoader.initializeNewPublishers(MetadataLoader.java:295)
 at 
org.apache.kafka.image.loader.MetadataLoader.lambda$scheduleInitializeNewPublishers$0(MetadataLoader.java:266)
 at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
 at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
 at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
 at java.base/java.lang.Thread.run(Thread.java:1583) [2024-06-03 06:14:01,556] 
INFO [BrokerLifecycleManager id=28] The broker is in RECOVERY. 
(kafka.server.BrokerLifecycleManager)
Scarier is that if any other node that is working gets restarted, they too 
start sending off that message as well.

Most scary is that when I restarted a kraft server, it now dies with the same 
error message and can never get spun up again (hoping things will just work 
using 2 kraft servers instead of 3 until this is resolved).

I am using a kraft setup and have within the past month upgraded to kafka 3.7 
(original setup over a year ago was kafka 3.4 which was later upgraded to 3.5 
and later to 3.6 before the recent upgrade to 3.7).

How do I resolve this issue? I'm not sure what the problem is or how to fix it.

Is it possible to recover from this or do I need to start from scratch? If I 
start from scratch, how do I keep the topics and offsets? What is the best way 
to proceed from here? I'm unable to find anything related to this problem via a 
google search.

I'm at a loss and would appreciate any help you can provide.

Thank you.

Reply via email to