NanerLee created KAFKA-9267: ------------------------------- Summary: ZkSecurityMigrator should not create /controller node Key: KAFKA-9267 URL: https://issues.apache.org/jira/browse/KAFKA-9267 Project: Kafka Issue Type: Bug Components: admin Reporter: NanerLee
As we can see in these source codes – [ZkSecurityMigrator.scala#L226|[https://github.com/apache/kafka/blob/2accf14ccf9b1f96c9dd8cfb94530c56378fae80/core/src/main/scala/kafka/admin/ZkSecurityMigrator.scala#L226]|https://github.com/apache/kafka/blob/2accf14ccf9b1f96c9dd8cfb94530c56378fae80/core/src/main/scala/kafka/admin/ZkSecurityMigrator.scala#L226]).] _ZkSecurityMigrator_ checks and sets acl recursively for each path in _SecureRootPaths_. And _/controller_ is also in _SecureRootPaths_. As we can predicted, _zkClient.makeSurePersistentPathExists()_ will create _/controller_ node if _/controller_ is not existed. _/controller_ is a *EPHEMERAL* node for controller election, but _makeSurePersistentPathExists()_ will create a *PERSISTENT* node with *null* data. If that happens, null data will cause a *NPE*, and the controller cannot be elected, kafka cluster will be unavailable . In addition, a *PERSISTENT* node doesn't disappear automatically, we have to delete it manually to fix the problem. *PERSISTENT* _/controller_ node with *null* data in zk: {code:java} [zk: localhost:2181(CONNECTED) 16] get /kafka/controller null cZxid = 0x1100002284 ctime = Tue Dec 03 18:37:26 CST 2019 mZxid = 0x1100002284 mtime = Tue Dec 03 18:37:26 CST 2019 pZxid = 0x1100002284 cversion = 0 dataVersion = 0 aclVersion = 1 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0{code} *Normal* /controller node in zk: {code:java} [zk: localhost:2181(CONNECTED) 21] get /kafka/controller {"version":1,"brokerid":1001,"timestamp":"1575370170528"} cZxid = 0x11000023e1 ctime = Tue Dec 03 18:49:30 CST 2019 mZxid = 0x11000023e1 mtime = Tue Dec 03 18:49:30 CST 2019 pZxid = 0x11000023e1 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x16ecb572df50021 dataLength = 57 numChildren = 0{code} *NPE* in controller.log : {code:java} [2019-11-21 15:02:41,276] INFO [ControllerEventThread controllerId=1002] Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2019-11-21 15:02:41,282] ERROR [ControllerEventThread controllerId=1002] Error processing event Startup (kafka.controller.ControllerEventManager$ControllerEventThread) java.lang.NullPointerException at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857) at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2572) at kafka.utils.Json$.parseBytes(Json.scala:62) at kafka.zk.ControllerZNode$.decode(ZkData.scala:56) at kafka.zk.KafkaZkClient.getControllerId(KafkaZkClient.scala:902) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1199) at kafka.controller.KafkaController$Startup$.process(KafkaController.scala:1148) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:85) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82){code} So, I submit a PR that _ZkSecurityMigrator_ will not handle _/controller_ node when _/controller_ is not existed. This bug seems to affect all versions, please review and merge the PR as soon as possible. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)