Viktor Somogyi-Vass created KAFKA-17950:
-------------------------------------------

             Summary: The leader requested truncation to below the current high 
watermark
                 Key: KAFKA-17950
                 URL: https://issues.apache.org/jira/browse/KAFKA-17950
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.9.0, 3.9.1
            Reporter: Viktor Somogyi-Vass
         Attachments: broker1.log, broker2.log, broker3.log, 
controller-logs.zip, controller1-migration-enabled.properties, 
controller1.properties, controller2-migration-enabled.properties, 
controller2.properties, controller3-migration-enabled.properties, 
controller3.properties, kraft1.log, kraft2.log, kraft3.log, producer-perf.log, 
producer.properties, server1-migrated-to-kraft.properties, 
server1-migration-enabled.properties, server1.properties, 
server2-migrated-to-kraft.properties, server2-migration-enabled.properties, 
server2.properties, server3-migrated-to-kraft.properties, 
server3-migration-enabled.properties, server3.properties, zookeeper.log

While testing the migration from 3.9 ZK Kafka to 3.9 KRaft, I find that in the 
last step (finalization) where I restart the controllers in non-migration mode, 
the last controller restart causes a fatal failure in the cluster: every node 
(broker and controller) stops beside the controller I restarted.
The failing nodes throw the same exception at the time:
{noformat}
[2024-11-06 14:02:13,498] ERROR Encountered fatal fault: Unexpected error in 
raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
org.apache.kafka.common.KafkaException: The leader requested truncation to 
offset 484, which is below the current high watermark 
LogOffsetMetadata(offset=508, metadata=Optional.empty)
        at 
org.apache.kafka.raft.KafkaRaftClient.lambda$handleFetchResponse$11(KafkaRaftClient.java:1619)
        at java.base/java.util.Optional.ifPresent(Optional.java:183)
        at 
org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1616)
        at 
org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2457)
        at 
org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2613)
        at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3312)
        at 
org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
        at 
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
{noformat}

Setup:
* single Zookeeper node
* 3 brokers
* 1 running producer-performance client
* 3 controllers

Repro:
# Start Zookeeper with zookeeper.properties
{noformat}
bin/zookeeper-server-start.sh repro-conf/zookeeper.properties
{noformat}
# Start brokers with serverX.properties
{noformat}
bin/kafka-server-start.sh repro-conf/server1.properties
bin/kafka-server-start.sh repro-conf/server2.properties
bin/kafka-server-start.sh repro-conf/server3.properties
{noformat}
# Start the producer-performance tool
{noformat}
bin/kafka-producer-perf-test.sh --topic test1 --num-records 1000000 
--throughput 100 --record-size 10000 --producer.config 
repro-conf/producer.properties
{noformat}
# Start the controllers in migration mode
{noformat}
bin/kafka-server-start.sh repro-conf/controller1-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/controller2-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/controller3-migration-enabled.properties
{noformat}
# Restart the brokers in migration mode with the following configs. (My restart 
order was 1,2,3.)
{noformat}
bin/kafka-server-start.sh repro-conf/server1-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/server2-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/server3-migration-enabled.properties
{noformat}
# Restart the brokers in migrated mode with the following configs (at this 
point they are connected to the controllers and not ZK). My restart order was 
1,2,3.
{noformat}
bin/kafka-server-start.sh repro-conf/server1-migrated-to-kraft.properties
bin/kafka-server-start.sh repro-conf/server2-migrated-to-kraft.properties
bin/kafka-server-start.sh repro-conf/server3-migrated-to-kraft.properties
{noformat}
# At this point all brokers run with KRaft, let's restart the controllers to 
finalize. (The order was 3,2,1.)
{noformat}
bin/kafka-server-start.sh repro-conf/controller3.properties
bin/kafka-server-start.sh repro-conf/controller2.properties
bin/kafka-server-start.sh repro-conf/controller1.properties
{noformat}
At the last restart, when controller1 starts up, all other nodes crash at once. 
Attached all logs and configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to