[ https://issues.apache.org/jira/browse/KAFKA-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Colin McCabe resolved KAFKA-14197. ---------------------------------- Resolution: Duplicate > Kraft broker fails to startup after topic creation failure > ---------------------------------------------------------- > > Key: KAFKA-14197 > URL: https://issues.apache.org/jira/browse/KAFKA-14197 > Project: Kafka > Issue Type: Bug > Components: kraft > Reporter: Luke Chen > Priority: Blocker > Fix For: 3.3.0 > > > In kraft ControllerWriteEvent, we start by trying to apply the record to > controller in-memory state, then sent out the record via raft client. But if > there is error during sending the records, there's no way to revert the > change to controller in-memory state[1]. > The issue happened when creating topics, controller state is updated with > topic and partition metadata (ex: broker to ISR map), but the record doesn't > send out successfully (ex: RecordBatchTooLargeException). Then, when shutting > down the node, the controlled shutdown will try to remove the broker from ISR > by[2]: > {code:java} > generateLeaderAndIsrUpdates("enterControlledShutdown[" + brokerId + "]", > brokerId, NO_LEADER, records, > brokersToIsrs.partitionsWithBrokerInIsr(brokerId));{code} > > After we appending the partitionChangeRecords, and send to metadata topic > successfully, it'll cause the brokers failed to "replay" these partition > change since these topic/partitions didn't get created successfully > previously. > Even worse, after restarting the node, all the metadata records will replay > again, and the same error happened again, cause the broker cannot start up > successfully. > > The error and call stack is like this, basically, it complains the topic > image can't be found > {code:java} > [2022-09-02 16:29:16,334] ERROR Encountered metadata loading fault: Error > replaying metadata log record at offset 81 > (org.apache.kafka.server.fault.LoggingFaultHandler) > java.lang.NullPointerException > at org.apache.kafka.image.TopicDelta.replay(TopicDelta.java:69) > at org.apache.kafka.image.TopicsDelta.replay(TopicsDelta.java:91) > at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:248) > at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:186) > at > kafka.server.metadata.BrokerMetadataListener.$anonfun$loadBatches$3(BrokerMetadataListener.scala:239) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) > at > kafka.server.metadata.BrokerMetadataListener.kafka$server$metadata$BrokerMetadataListener$$loadBatches(BrokerMetadataListener.scala:232) > at > kafka.server.metadata.BrokerMetadataListener$HandleCommitsEvent.run(BrokerMetadataListener.scala:113) > at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173) > at java.base/java.lang.Thread.run(Thread.java:829) > {code} > > [1] > [https://github.com/apache/kafka/blob/ef65b6e566ef69b2f9b58038c98a5993563d7a68/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L779-L804] > > [2] > [https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1270] -- This message was sent by Atlassian Jira (v8.20.10#820010)