Hi all We're in the process of upgrading from 2.2 to 2.6, and move from an old set of nodes to a new set. We have an engineering environment where all went smooth - upgrade the software on all nodes - restart all brokers (we didn't have the requirment to make it rolling) - make the new nodes join the cluster - move all topic data over with kafka-reassign-partitions.sh - stop the now "empty" brokers on the old nodes
Now we tried to do the same in our development environment. Upgrade was ok and the new nodes joined the cluster. All looked good. But already the first kafka-reassign-partitions.sh of a very small topic with only a few thousand messages took forever. It seemed like stuck in trying to move the partitions from the old nodes to the new nodes. I didn't see anything peculiar in the logs, so I decided to cancel the reassignment and restarted it. From there, everything went south. The cluster seemed to accept new reassignments, but nothing would move. The directories in the data/ on the new target nodes were created, but no data would arrive and the topic remained underreplicated. We began to restart nodes, taking all down and up again, to no avail. Things went really bad, and soon all nodes failed to start properly (would auto shutdown after a few seconds) with the message [2020-09-18 13:56:29,242] ERROR [KafkaServer id=2] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) java.lang.IllegalArgumentException: requirement failed: Split operation is only permitted for segments with overflow Just before that , there are warnings in the log like [2020-09-18 13:56:29,235] WARN [Log partition=XXX_Flat-0, dir=/app/PCY/dyn/pcykafka/data] Found a corrupted index file corresponding to log file /app/kafka/data/XXX_Flat-0/00000000000000004938.log due to Corrupt time index found, time index file (/app/kafka/data/XXX_Flat-0/00000000000000004938.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1599477219176}, recovering segment and rebuilding index files... (kafka.log.Log) [2020-09-18 13:56:29,238] INFO [Log partition=XXX_Flat-0, dir=/app/kafka/data] Loading producer state till offset 4938 with message format version 2 (kafka.log.Log) [2020-09-18 13:56:29,239] INFO [ProducerStateManager partition=XXX_Flat-0] Writing producer snapshot at offset 4938 (kafka.log.ProducerStateManager) So now I'm totally stuck and can't start the brokers anymore (neither the new ones not the old ones). The problem is not so much a dataloss (I have backups and can go back, but it's probably not even worth the effort in DEV). What I need to understand is, what happened here and how to avoid it in test and prod environments. I have >6GB of logs, but honestly I don't know what to look for. Obviously I tried to understand from the logs from the time when the problems started, but I have not been able to identify any cause. Anybody got some hints? CU, Joe