Hi, We are running Apache Kafka 2.7.1 on AWS EC2 with ZooKeeper running in the same VPC (private network).
Since version 2.7.0 we observe the subj. issue with the first start of the Kafka process when we rotate the EC2 instances (for the sake of software upgrade). Our supervisor script notices the failure and tries to start it again in a few seconds, which has always been successful so far. Sample logs from the first start: 2021-07-08 15:52:26.414 INFO starting (kafka.server.KafkaServer) 2021-07-08 15:52:26.415 INFO Connecting to zookeeper on 10.154.206.111:2181 ,10.154.192.139:2181,10.154.215.44:2181/staging-fra-2 (kafka.server.KafkaServer) 2021-07-08 15:52:26.432 INFO [ZooKeeperClient Kafka server] Initializing a new session to 10.154.206.111:2181,10.154.192.139:2181,10.154.215.44:2181. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:26.438 INFO Client environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, built on 01/06/2021 20:03 GMT (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:host.name=ip-10-154-192-11.eu-central-1.compute.internal (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:java.version=1.8.0_292 (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:java.vendor=Oracle Corporation (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:java.home=/usr/local/openjdk-8/jre (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:java.class.path=/opt/kafka/bin/../libs/activation-1.1.1.jar:... 2021-07-08 15:52:26.438 INFO Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:os.version=4.15.0-1099-aws (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.438 INFO Client environment:user.name=? (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.439 INFO Client environment:user.home=? (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.439 INFO Client environment:user.dir=/ (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.439 INFO Client environment:os.memory.free=6074MB (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.439 INFO Client environment:os.memory.max=6144MB (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.442 INFO Client environment:os.memory.total=6144MB (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.444 INFO Initiating client connection, connectString= 10.154.206.111:2181,10.154.192.139:2181,10.154.215.44:2181 sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@5158b42f (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:26.447 INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket) 2021-07-08 15:52:26.453 INFO zookeeper.request.timeout value is 0. feature enabled= (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:26.455 INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) * 2021-07-08 15:52:26.493 INFO Opening socket connection to server ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181 <http://10.154.192.139:2181>. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:32.456 INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:32.459 WARN Client session timed out, have not heard from server in 6006ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)* 2021-07-08 15:52:32.564 INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:32.564 INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:32.566 INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:32.569 ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING #011at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:262) #011at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119) #011at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1881) #011at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:441) #011at kafka.server.KafkaServer.$anonfun$initZkClient$2(KafkaServer.scala:460) #011at kafka.server.KafkaServer.$anonfun$initZkClient$2$adapted(KafkaServer.scala:458) #011at scala.Option.foreach(Option.scala:437) #011at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:458) #011at kafka.server.KafkaServer.startup(KafkaServer.scala:233) #011at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44) #011at kafka.Kafka$.main(Kafka.scala:82) #011at kafka.Kafka.main(Kafka.scala) 2021-07-08 15:52:32.570 INFO shutting down (kafka.server.KafkaServer) 2021-07-08 15:52:32.574 INFO App info kafka.server for -1 unregistered (org.apache.kafka.common.utils.AppInfoParser) 2021-07-08 15:52:32.574 INFO shut down completed (kafka.server.KafkaServer) 2021-07-08 15:52:32.575 ERROR Exiting Kafka. (kafka.server.KafkaServerStartable) 2021-07-08 15:52:32.578 INFO shutting down (kafka.server.KafkaServer) Logs from the same EC2 instance after 2nd attempt to start (successful): 2021-07-08 15:52:46.624 INFO Initiating client connection, connectString= 10.154.206.111:2181,10.154.192.139:2181,10.154.215.44:2181 sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeper ClientWatcher$@5158b42f (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:46.627 INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket) 2021-07-08 15:52:46.633 INFO zookeeper.request.timeout value is 0. feature enabled= (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.635 INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.638 INFO Opening socket connection to server ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.641 INFO Socket connection established, initiating session, client: /10.154.192.11:50816, server: ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181 (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.645 INFO Session establishment complete on server ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181, sessionid = 0x800000b004865c7, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.648 INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.691 INFO Created zookeeper path /staging-fra-2 (kafka.server.KafkaServer) 2021-07-08 15:52:46.692 INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.795 INFO Session: 0x800000b004865c7 closed (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:46.795 INFO EventThread shut down for session: 0x800000b004865c7 (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.797 INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.798 INFO [ZooKeeperClient Kafka server] Initializing a new session to 10.154.206.111:2181,10.154.192.139:2181, 10.154.215.44:2181/staging-fra-2. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.798 INFO Initiating client connection, connectString= 10.154.206.111:2181,10.154.192.139:2181,10.154.215.44:2181/staging-fra-2 sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@3a52dba3 (org.apache.zookeeper.ZooKeeper) 2021-07-08 15:52:46.799 INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket) 2021-07-08 15:52:46.799 INFO zookeeper.request.timeout value is 0. feature enabled= (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.800 INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.800 INFO Opening socket connection to server ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.801 INFO Socket connection established, initiating session, client: /10.154.192.11:50818, server: ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181 (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.803 INFO Session establishment complete on server ip-10-154-192-139.eu-central-1.compute.internal/10.154.192.139:2181, sessionid = 0x800000b004865c8, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn) 2021-07-08 15:52:46.803 INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient) 2021-07-08 15:52:46.824 INFO [feature-zk-node-event-process-thread]: Starting (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) Now my questions are: 1. What could be the reason for a 6-second timeout on the first attempt to connect? Has anything changed significantly in this regard since version 2.6? On a quick glance over the release notes[1] I can't spot anything relevant. 2. Why does it "just work" within a few milliseconds on the second try? 3. We do not see anything relevant logged from ZooKeeper around the time of failed connection attempts. Is there a chance that DEBUG logging on ZooKeeper will reveal something? [1] https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html Regards, -- Alex