[ https://issues.apache.org/jira/browse/KAFKA-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
NanerLee updated KAFKA-9267: ---------------------------- Description: As we can see in these source codes – [ZkSecurityMigrator.scala#L226|#L226]] _ZkSecurityMigrator_ checks and sets acl recursively for each path in _SecureRootPaths_. And _/controller_ is also in _SecureRootPaths_. As we can predicted, _zkClient.makeSurePersistentPathExists()_ will create _/controller_ node if _/controller_ is not existed. _/controller_ is a *EPHEMERAL* node for controller election, but _makeSurePersistentPathExists()_ will create a *PERSISTENT* node with *null* data. If that happens, null data will cause a *NPE*, and the controller cannot be elected, kafka cluster will be unavailable . In addition, a *PERSISTENT* node doesn't disappear automatically, we have to delete it manually to fix the problem. *PERSISTENT* _/controller_ node with *null* data in zk: {code:java} [zk: localhost:2181(CONNECTED) 16] get /kafka/controller null cZxid = 0x1100002284 ctime = Tue Dec 03 18:37:26 CST 2019 mZxid = 0x1100002284 mtime = Tue Dec 03 18:37:26 CST 2019 pZxid = 0x1100002284 cversion = 0 dataVersion = 0 aclVersion = 1 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0{code} *Normal* /controller node in zk: {code:java} [zk: localhost:2181(CONNECTED) 21] get /kafka/controller {"version":1,"brokerid":1001,"timestamp":"1575370170528"} cZxid = 0x11000023e1 ctime = Tue Dec 03 18:49:30 CST 2019 mZxid = 0x11000023e1 mtime = Tue Dec 03 18:49:30 CST 2019 pZxid = 0x11000023e1 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x16ecb572df50021 dataLength = 57 numChildren = 0{code} *NPE* in controller.log : {code:java} [2019-11-21 15:02:41,276] INFO [ControllerEventThread controllerId=1002] Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2019-11-21 15:02:41,282] ERROR [ControllerEventThread controllerId=1002] Error processing event Startup (kafka.controller.ControllerEventManager$ControllerEventThread) java.lang.NullPointerException at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857) at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2572) at kafka.utils.Json$.parseBytes(Json.scala:62) at kafka.zk.ControllerZNode$.decode(ZkData.scala:56) at kafka.zk.KafkaZkClient.getControllerId(KafkaZkClient.scala:902) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1199) at kafka.controller.KafkaController$Startup$.process(KafkaController.scala:1148) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:85) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82){code} So, I submit a PR that _ZkSecurityMigrator_ will not handle _/controller_ node when _/controller_ is not existed. This bug seems to affect all versions, please review and merge the PR as soon as possible. Thanks! was: As we can see in these source codes – [ZkSecurityMigrator.scala#L226|[https://github.com/apache/kafka/blob/2accf14ccf9b1f96c9dd8cfb94530c56378fae80/core/src/main/scala/kafka/admin/ZkSecurityMigrator.scala#L226]|https://github.com/apache/kafka/blob/2accf14ccf9b1f96c9dd8cfb94530c56378fae80/core/src/main/scala/kafka/admin/ZkSecurityMigrator.scala#L226]).] _ZkSecurityMigrator_ checks and sets acl recursively for each path in _SecureRootPaths_. And _/controller_ is also in _SecureRootPaths_. As we can predicted, _zkClient.makeSurePersistentPathExists()_ will create _/controller_ node if _/controller_ is not existed. _/controller_ is a *EPHEMERAL* node for controller election, but _makeSurePersistentPathExists()_ will create a *PERSISTENT* node with *null* data. If that happens, null data will cause a *NPE*, and the controller cannot be elected, kafka cluster will be unavailable . In addition, a *PERSISTENT* node doesn't disappear automatically, we have to delete it manually to fix the problem. *PERSISTENT* _/controller_ node with *null* data in zk: {code:java} [zk: localhost:2181(CONNECTED) 16] get /kafka/controller null cZxid = 0x1100002284 ctime = Tue Dec 03 18:37:26 CST 2019 mZxid = 0x1100002284 mtime = Tue Dec 03 18:37:26 CST 2019 pZxid = 0x1100002284 cversion = 0 dataVersion = 0 aclVersion = 1 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0{code} *Normal* /controller node in zk: {code:java} [zk: localhost:2181(CONNECTED) 21] get /kafka/controller {"version":1,"brokerid":1001,"timestamp":"1575370170528"} cZxid = 0x11000023e1 ctime = Tue Dec 03 18:49:30 CST 2019 mZxid = 0x11000023e1 mtime = Tue Dec 03 18:49:30 CST 2019 pZxid = 0x11000023e1 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x16ecb572df50021 dataLength = 57 numChildren = 0{code} *NPE* in controller.log : {code:java} [2019-11-21 15:02:41,276] INFO [ControllerEventThread controllerId=1002] Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2019-11-21 15:02:41,282] ERROR [ControllerEventThread controllerId=1002] Error processing event Startup (kafka.controller.ControllerEventManager$ControllerEventThread) java.lang.NullPointerException at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857) at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2572) at kafka.utils.Json$.parseBytes(Json.scala:62) at kafka.zk.ControllerZNode$.decode(ZkData.scala:56) at kafka.zk.KafkaZkClient.getControllerId(KafkaZkClient.scala:902) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1199) at kafka.controller.KafkaController$Startup$.process(KafkaController.scala:1148) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:85) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82){code} So, I submit a PR that _ZkSecurityMigrator_ will not handle _/controller_ node when _/controller_ is not existed. This bug seems to affect all versions, please review and merge the PR as soon as possible. Thanks! > ZkSecurityMigrator should not create /controller node > ----------------------------------------------------- > > Key: KAFKA-9267 > URL: https://issues.apache.org/jira/browse/KAFKA-9267 > Project: Kafka > Issue Type: Bug > Components: admin > Reporter: NanerLee > Priority: Major > > As we can see in these source codes – [ZkSecurityMigrator.scala#L226|#L226]] > _ZkSecurityMigrator_ checks and sets acl recursively for each path in > _SecureRootPaths_. And _/controller_ is also in _SecureRootPaths_. > As we can predicted, _zkClient.makeSurePersistentPathExists()_ will create > _/controller_ node if _/controller_ is not existed. > _/controller_ is a *EPHEMERAL* node for controller election, but > _makeSurePersistentPathExists()_ will create a *PERSISTENT* node with *null* > data. > If that happens, null data will cause a *NPE*, and the controller cannot be > elected, kafka cluster will be unavailable . > In addition, a *PERSISTENT* node doesn't disappear automatically, we have to > delete it manually to fix the problem. > > *PERSISTENT* _/controller_ node with *null* data in zk: > {code:java} > [zk: localhost:2181(CONNECTED) 16] get /kafka/controller > null > cZxid = 0x1100002284 > ctime = Tue Dec 03 18:37:26 CST 2019 > mZxid = 0x1100002284 > mtime = Tue Dec 03 18:37:26 CST 2019 > pZxid = 0x1100002284 > cversion = 0 > dataVersion = 0 > aclVersion = 1 > ephemeralOwner = 0x0 > dataLength = 0 > numChildren = 0{code} > *Normal* /controller node in zk: > {code:java} > [zk: localhost:2181(CONNECTED) 21] get /kafka/controller > {"version":1,"brokerid":1001,"timestamp":"1575370170528"} > cZxid = 0x11000023e1 > ctime = Tue Dec 03 18:49:30 CST 2019 > mZxid = 0x11000023e1 > mtime = Tue Dec 03 18:49:30 CST 2019 > pZxid = 0x11000023e1 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x16ecb572df50021 > dataLength = 57 > numChildren = 0{code} > *NPE* in controller.log : > {code:java} > [2019-11-21 15:02:41,276] INFO [ControllerEventThread controllerId=1002] > Starting (kafka.controller.ControllerEventManager$ControllerEventThread) > [2019-11-21 15:02:41,282] ERROR [ControllerEventThread controllerId=1002] > Error processing event Startup > (kafka.controller.ControllerEventManager$ControllerEventThread) > java.lang.NullPointerException > at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857) > at > com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2572) > at kafka.utils.Json$.parseBytes(Json.scala:62) > at kafka.zk.ControllerZNode$.decode(ZkData.scala:56) > at kafka.zk.KafkaZkClient.getControllerId(KafkaZkClient.scala:902) > at > kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1199) > at > kafka.controller.KafkaController$Startup$.process(KafkaController.scala:1148) > at > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:86) > at > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) > at > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) > at > kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:85) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82){code} > > So, I submit a PR that _ZkSecurityMigrator_ will not handle _/controller_ > node when _/controller_ is not existed. > This bug seems to affect all versions, please review and merge the PR as soon > as possible. > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)