Juha Mynttinen created KAFKA-17752:
--------------------------------------

             Summary: Contoller crashes when removed if it is an initial 
controller
                 Key: KAFKA-17752
                 URL: https://issues.apache.org/jira/browse/KAFKA-17752
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.9.0
            Reporter: Juha Mynttinen


Hey, 

Tested using 3.9.0 RC0.

It seems that "kafka-metadata-quorum.sh remove-controller" causes the removed 
controller to crash if it is one of the controllers specified using 
"--initial-controllers "

Steps to reproduce:

Clean up and setup the environment

rm -rf /tmp/controllers && \
mkdir -p /tmp/controllers/c1 && \
mkdir -p /tmp/controllers/c2 && \
mkdir -p /tmp/controllers/c3

export KAFKA_HOME=<your_kafka_3_9_home>

Format the controllers

$KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 
00000000-0000-0000-0000-000000000001 --initial-controllers 
1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
 --config c1.properties
$KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 
00000000-0000-0000-0000-000000000001 --initial-controllers 
1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
 --config c2.properties
$KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 
00000000-0000-0000-0000-000000000001 --initial-controllers 
1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA
 --config c3.properties

Start the controllers, in separate terminals

$KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka c1.properties
$KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka c2.properties
$KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka c3.properties

Remove a controller:

$KAFKA_HOME/bin/kafka-metadata-quorum.sh --bootstrap-controller 
localhost:10001,localhost:10002,localhost:10003,localhost:10004 
remove-controller --controller-id 1001 --controller-directory-id 
AAAAAAAAAAEAAAAAAAAAAA

The process crashes with the following error:

[2024-10-09 15:19:15,574] ERROR Encountered fatal fault: exception while 
renouncing leadership 
(org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
java.lang.RuntimeException: Unable to reset to last stable offset 55. No 
in-memory snapshot found for this offset.
        at 
org.apache.kafka.controller.OffsetControlManager.deactivate(OffsetControlManager.java:268)
        at 
org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1281)
        at 
org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:552)
        at 
org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:180)
        at 
org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:885)
        at 
org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:875)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:153)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:142)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215)
        at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:840)

If the process that died is restarted it joins the cluster and becomes on 
observer, as expected.

The crash doesn't happen in a slightly different case, exact steps missing. But 
the idea is this:
1. Create a 3-controller cluster as above
2. Format and start a 4rd controller. 
3. Add the 4th controller as a voter.
4. Remove the 4th controller to make it an observer. It becomes observer as 
expected.

Because this case works, I'm guessing the crash is somehow related to the 
controller being one of the initial controllers.

I didn't dig deeper on why the crash occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to