Scott Kidder created FLINK-7021: ----------------------------------- Summary: Flink Task Manager hangs on startup if one Zookeeper node is unresolvable Key: FLINK-7021 URL: https://issues.apache.org/jira/browse/FLINK-7021 Project: Flink Issue Type: Bug Components: Core Affects Versions: 1.2.1, 1.3.0, 1.2.0 Environment: Kubernetes cluster running: * Flink 1.3.0 Job Manager & Task Manager on Java 8u131 * Zookeeper 3.4.10 cluster with 3 nodes Reporter: Scott Kidder
h2. Problem Flink Task Manager will hang during startup if one of the Zookeeper nodes in the Zookeeper connection string is unresolvable. h2. Expected Behavior Flink should retry name resolution & connection to Zookeeper nodes with exponential back-off. h2. Environment Details We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in a configuration that automatically detects and applies operating system updates. We have a Zookeeper node running on the same CoreOS instance as Flink. It's possible that the Zookeeper node will not yet be started when the Flink components are started. This could cause hostname resolution of the Zookeeper nodes to fail. h3. Flink Task Manager Logs {noformat} 2017-06-27 15:38:51,713 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using configured hostname/address for TaskManager: 10.2.45.11 2017-06-27 15:38:51,714 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager 2017-06-27 15:38:51,714 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at 10.2.45.11:6122. 2017-06-27 15:38:52,950 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2017-06-27 15:38:53,079 INFO Remoting - Starting remoting 2017-06-27 15:38:53,573 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@10.2.45.11:6122] 2017-06-27 15:38:53,576 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor 2017-06-27 15:38:53,660 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: /10.2.45.11, server port: 6121, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 2017-06-27 15:38:53,682 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms 2017-06-27 15:38:53,688 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 49 GB, usable 42 GB (85.71% usable) 2017-06-27 15:38:54,071 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 96 MB for network buffer pool (number of memory segments: 3095, bytes per segment: 32768). 2017-06-27 15:38:54,564 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components. 2017-06-27 15:38:54,576 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). 2017-06-27 15:38:54,677 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 101 ms). Listening on SocketAddress /10.2.45.11:6121. 2017-06-27 15:38:54,981 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (612 MB), memory will be allocated lazily. 2017-06-27 15:38:55,050 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-ca01554d-f25e-4c17-a828-96d82b43d4a7 for spill files. 2017-06-27 15:38:55,061 INFO org.apache.flink.runtime.metrics.MetricRegistry - Configuring StatsDReporter with {interval=10 SECONDS, port=8125, host=localhost, class=org.apache.flink.metrics.statsd.StatsDReporter}. 2017-06-27 15:38:55,065 INFO org.apache.flink.metrics.statsd.StatsDReporter - Configured StatsDReporter with {host:localhost, port:8125} 2017-06-27 15:38:55,065 INFO org.apache.flink.runtime.metrics.MetricRegistry - Periodically reporting metrics in intervals of 10 SECONDS for reporter statsd of type org.apache.flink.metrics.statsd.StatsDReporter. 2017-06-27 15:38:55,175 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-e4c5bcc5-7513-40d9-a665-0d33c80a36ba 2017-06-27 15:38:55,187 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-310ba2f8-f96a-4c3f-b1db-35ac26b83f7e 2017-06-27 15:38:55,273 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#207081801. 2017-06-27 15:38:55,273 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: 7f86855dac2af4cca9eb2ae4c046630e @ flink-taskmanager-3116622558-sqggc (dataPort=6121) 2017-06-27 15:38:55,273 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 2 task slot(s). 2017-06-27 15:38:55,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 124/981/981 MB, NON HEAP: 43/44/-1 MB (used/committed/max)] 2017-06-27 15:38:55,276 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. 2017-06-27 15:39:10,289 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection timed out for connection string (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) and timeout (15000) / elapsed (18617) org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242) at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175) at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154) at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100) at org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205) at akka.actor.Actor$class.aroundPreStart(Actor.scala:472) at org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120) at akka.actor.ActorCell.create(ActorCell.scala:580) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2017-06-27 15:39:30,349 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af 2017-06-27 15:40:00,388 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 68719 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2017-06-27 15:40:00,388 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af 2017-06-27 15:40:00,450 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Ensure path threw exception java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) at org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150) at org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128) at org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77) at org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261) at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221) at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90) at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594) at org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158) at org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242) at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175) at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154) at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100) at org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205) at akka.actor.Actor$class.aroundPreStart(Actor.scala:472) at org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120) at akka.actor.ActorCell.create(ActorCell.scala:580) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)