[ https://issues.apache.org/jira/browse/KAFKA-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viktor Somogyi-Vass resolved KAFKA-17950. ----------------------------------------- Resolution: Invalid Ok, my mistake. It seems like an incorrect voter list configuration caused the issue in controller1.properties > The leader requested truncation to below the current high watermark > ------------------------------------------------------------------- > > Key: KAFKA-17950 > URL: https://issues.apache.org/jira/browse/KAFKA-17950 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.9.0, 3.9.1 > Reporter: Viktor Somogyi-Vass > Priority: Blocker > Attachments: broker1.log, broker2.log, broker3.log, > controller-logs.zip, controller1-migration-enabled.properties, > controller1.properties, controller2-migration-enabled.properties, > controller2.properties, controller3-migration-enabled.properties, > controller3.properties, kraft1.log, kraft2.log, kraft3.log, > producer-perf.log, producer.properties, server1-migrated-to-kraft.properties, > server1-migration-enabled.properties, server1.properties, > server2-migrated-to-kraft.properties, server2-migration-enabled.properties, > server2.properties, server3-migrated-to-kraft.properties, > server3-migration-enabled.properties, server3.properties, zookeeper.log > > > While testing the migration from 3.9 ZK Kafka to 3.9 KRaft, I find that in > the last step (finalization) where I restart the controllers in non-migration > mode, the last controller restart causes a fatal failure in the cluster: > every node (broker and controller) stops beside the controller I restarted. > The failing nodes throw the same exception at the time: > {noformat} > [2024-11-06 14:02:13,498] ERROR Encountered fatal fault: Unexpected error in > raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) > org.apache.kafka.common.KafkaException: The leader requested truncation to > offset 484, which is below the current high watermark > LogOffsetMetadata(offset=508, metadata=Optional.empty) > at > org.apache.kafka.raft.KafkaRaftClient.lambda$handleFetchResponse$11(KafkaRaftClient.java:1619) > at java.base/java.util.Optional.ifPresent(Optional.java:183) > at > org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1616) > at > org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2457) > at > org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2613) > at > org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3312) > at > org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64) > at > org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136) > {noformat} > Setup: > * single Zookeeper node > * 3 brokers > * 1 running producer-performance client > * 3 controllers > Repro: > # Start Zookeeper with zookeeper.properties > {noformat} > bin/zookeeper-server-start.sh repro-conf/zookeeper.properties > {noformat} > # Start brokers with serverX.properties > {noformat} > bin/kafka-server-start.sh repro-conf/server1.properties > bin/kafka-server-start.sh repro-conf/server2.properties > bin/kafka-server-start.sh repro-conf/server3.properties > {noformat} > # Start the producer-performance tool > {noformat} > bin/kafka-producer-perf-test.sh --topic test1 --num-records 1000000 > --throughput 100 --record-size 10000 --producer.config > repro-conf/producer.properties > {noformat} > # Get the cluster ID and format all controller log dirs > # Start the controllers in migration mode > {noformat} > bin/kafka-server-start.sh repro-conf/controller1-migration-enabled.properties > bin/kafka-server-start.sh repro-conf/controller2-migration-enabled.properties > bin/kafka-server-start.sh repro-conf/controller3-migration-enabled.properties > {noformat} > # Restart the brokers (rolling) in migration mode with the following configs. > (My restart order was 1,2,3.) > {noformat} > bin/kafka-server-start.sh repro-conf/server1-migration-enabled.properties > bin/kafka-server-start.sh repro-conf/server2-migration-enabled.properties > bin/kafka-server-start.sh repro-conf/server3-migration-enabled.properties > {noformat} > # Restart the brokers (rolling) in migrated mode with the following configs > (at this point they are connected to the controllers and not ZK). My restart > order was 1,2,3. > {noformat} > bin/kafka-server-start.sh repro-conf/server1-migrated-to-kraft.properties > bin/kafka-server-start.sh repro-conf/server2-migrated-to-kraft.properties > bin/kafka-server-start.sh repro-conf/server3-migrated-to-kraft.properties > {noformat} > # At this point all brokers run with KRaft, let's rolling restart the > controllers to finalize. (The order was 3,2,1.) > {noformat} > bin/kafka-server-start.sh repro-conf/controller3.properties > bin/kafka-server-start.sh repro-conf/controller2.properties > bin/kafka-server-start.sh repro-conf/controller1.properties > {noformat} > At the last restart, when controller1 starts up, all other nodes crash at > once. Attached all logs and configuration. > I've been working from the 3.9 branch, the hash is 4a562cd. -- This message was sent by Atlassian Jira (v8.20.10#820010)