Viktor Somogyi-Vass created KAFKA-17950: -------------------------------------------
Summary: The leader requested truncation to below the current high watermark Key: KAFKA-17950 URL: https://issues.apache.org/jira/browse/KAFKA-17950 Project: Kafka Issue Type: Bug Affects Versions: 3.9.0, 3.9.1 Reporter: Viktor Somogyi-Vass Attachments: broker1.log, broker2.log, broker3.log, controller-logs.zip, controller1-migration-enabled.properties, controller1.properties, controller2-migration-enabled.properties, controller2.properties, controller3-migration-enabled.properties, controller3.properties, kraft1.log, kraft2.log, kraft3.log, producer-perf.log, producer.properties, server1-migrated-to-kraft.properties, server1-migration-enabled.properties, server1.properties, server2-migrated-to-kraft.properties, server2-migration-enabled.properties, server2.properties, server3-migrated-to-kraft.properties, server3-migration-enabled.properties, server3.properties, zookeeper.log While testing the migration from 3.9 ZK Kafka to 3.9 KRaft, I find that in the last step (finalization) where I restart the controllers in non-migration mode, the last controller restart causes a fatal failure in the cluster: every node (broker and controller) stops beside the controller I restarted. The failing nodes throw the same exception at the time: {noformat} [2024-11-06 14:02:13,498] ERROR Encountered fatal fault: Unexpected error in raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) org.apache.kafka.common.KafkaException: The leader requested truncation to offset 484, which is below the current high watermark LogOffsetMetadata(offset=508, metadata=Optional.empty) at org.apache.kafka.raft.KafkaRaftClient.lambda$handleFetchResponse$11(KafkaRaftClient.java:1619) at java.base/java.util.Optional.ifPresent(Optional.java:183) at org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1616) at org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2457) at org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2613) at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3312) at org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64) at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136) {noformat} Setup: * single Zookeeper node * 3 brokers * 1 running producer-performance client * 3 controllers Repro: # Start Zookeeper with zookeeper.properties {noformat} bin/zookeeper-server-start.sh repro-conf/zookeeper.properties {noformat} # Start brokers with serverX.properties {noformat} bin/kafka-server-start.sh repro-conf/server1.properties bin/kafka-server-start.sh repro-conf/server2.properties bin/kafka-server-start.sh repro-conf/server3.properties {noformat} # Start the producer-performance tool {noformat} bin/kafka-producer-perf-test.sh --topic test1 --num-records 1000000 --throughput 100 --record-size 10000 --producer.config repro-conf/producer.properties {noformat} # Start the controllers in migration mode {noformat} bin/kafka-server-start.sh repro-conf/controller1-migration-enabled.properties bin/kafka-server-start.sh repro-conf/controller2-migration-enabled.properties bin/kafka-server-start.sh repro-conf/controller3-migration-enabled.properties {noformat} # Restart the brokers in migration mode with the following configs. (My restart order was 1,2,3.) {noformat} bin/kafka-server-start.sh repro-conf/server1-migration-enabled.properties bin/kafka-server-start.sh repro-conf/server2-migration-enabled.properties bin/kafka-server-start.sh repro-conf/server3-migration-enabled.properties {noformat} # Restart the brokers in migrated mode with the following configs (at this point they are connected to the controllers and not ZK). My restart order was 1,2,3. {noformat} bin/kafka-server-start.sh repro-conf/server1-migrated-to-kraft.properties bin/kafka-server-start.sh repro-conf/server2-migrated-to-kraft.properties bin/kafka-server-start.sh repro-conf/server3-migrated-to-kraft.properties {noformat} # At this point all brokers run with KRaft, let's restart the controllers to finalize. (The order was 3,2,1.) {noformat} bin/kafka-server-start.sh repro-conf/controller3.properties bin/kafka-server-start.sh repro-conf/controller2.properties bin/kafka-server-start.sh repro-conf/controller1.properties {noformat} At the last restart, when controller1 starts up, all other nodes crash at once. Attached all logs and configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)