Oh that's fine. I was just wondering why it happened. It seems to have gone away since the reboot.
On Fri, 18 Oct 2019 at 10:43, Till Rohrmann <trohrm...@apache.org> wrote: > Hi John, > > the reason why you are seeing these warnings is because Akka tries to > re-establish the connection to a lost endpoint (here a dead TaskExecutor). > This should continue until the connection is either quarantined or if the > underlying ActorRef to the remote endpoint has been garbage collected. The > former should not really happen and the latter should happen after Flink > has realized that the TaskExecutor has died. Flink uses its own heartbeats > to detect this. Depending on the configuration (default value is 50s), this > can take a bit. However, the warnings should eventually stop to be > displayed. > > I admit that this is not ideal in a scenario where TaskExecutors die > regularly but it helps to debug problematic scenarios. One way to suppress > these statements is to set the logger for akka.remote to ERROR. But then > one would not see if Akka has lost the connection and tries to reconnect. > > Cheers, > Till > > On Thu, Oct 10, 2019 at 5:31 PM John Smith <java.dev....@gmail.com> wrote: > >> Ok so it seems there was some sort of network issue. Then leader >> election. But it seems it had some old state and kept trying to connect to >> the same task machine over and over...? >> >> 2019-09-19 22:26:14,841 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Unable to read additional data from server sessionid 0xXXXXXX, likely >> server has closed socket, closing socket connection and attempting reconnect >> 2019-09-19 22:26:14,946 INFO >> >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager >> - State change: SUSPENDED >> 2019-09-19 22:26:14,947 WARN >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper suspended. The contender http://XXXXXX-2:8081 no >> longer participates in the leader election. >> 2019-09-19 22:26:14,947 WARN >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >> - Connection to ZooKeeper suspended. Can no longer retrieve the leader >> from ZooKeeper. >> 2019-09-19 22:26:14,947 WARN >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >> - Connection to ZooKeeper suspended. Can no longer retrieve the leader >> from ZooKeeper. >> 2019-09-19 22:26:14,948 WARN >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper suspended. The contender >> akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager >> no longer participates in the leader election. >> 2019-09-19 22:26:14,948 WARN >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper suspended. The contender >> akka.tcp://flink@fXXXXXX-2:37697/user/dispatcher >> no longer participates in the leader election. >> 2019-09-19 22:26:14,949 WARN >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are >> not monitored (temporarily). >> 2019-09-19 22:26:25,185 WARN >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL >> configuration failed: javax.security.auth.login.LoginException: No JAAS >> configuration section named 'Client' was found in specified JAAS >> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue >> connection to Zookeeper server without SASL authentication, if Zookeeper >> server allows it. >> 2019-09-19 22:26:25,186 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Opening socket connection to server XXXXXX.71/XXXXXX.71:2181 >> 2019-09-19 22:26:25,186 ERROR >> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - >> Authentication failed >> 2019-09-19 22:26:25,192 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Socket connection established to XXXXXX.71/XXXXXX.71:2181, initiating >> session >> 2019-09-19 22:26:25,199 WARN >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has >> expired >> 2019-09-19 22:26:25,199 INFO >> >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager >> - State change: LOST >> 2019-09-19 22:26:25,199 WARN >> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - >> Session expired event received >> 2019-09-19 22:26:25,199 WARN >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper lost. The contender http://XXXXXX-2:8081 no >> longer participates in the leader election. >> 2019-09-19 22:26:25,199 WARN >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >> - Connection to ZooKeeper lost. Can no longer retrieve the leader from >> ZooKeeper. >> 2019-09-19 22:26:25,200 WARN >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >> - Connection to ZooKeeper lost. Can no longer retrieve the leader from >> ZooKeeper. >> 2019-09-19 22:26:25,199 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - >> Initiating client connection, >> connectString=XXXXXX-1.XXXXXX:2181,XXXXXX-2.XXXXXX:2181,XXXXXX-3.XXXXXX:2181 >> sessionTimeout=60000 >> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@2bec854f >> 2019-09-19 22:26:25,200 WARN >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper lost. The contender >> akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager >> no longer participates in the leader election. >> 2019-09-19 22:26:25,200 WARN >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper lost. The contender >> akka.tcp://flink@XXXXXX-2:37697/user/dispatcher >> no longer participates in the leader election. >> 2019-09-19 22:26:25,201 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has >> expired, closing socket connection >> 2019-09-19 22:26:25,201 WARN >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> ZooKeeper connection LOST. Changes to the submitted job graphs are not >> monitored (permanently). >> 2019-09-19 22:26:25,220 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> EventThread shut down for session: 0x3017fc1a6660000 >> 2019-09-19 22:26:25,231 WARN >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL >> configuration failed: javax.security.auth.login.LoginException: No JAAS >> configuration section named 'Client' was found in specified JAAS >> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue >> connection to Zookeeper server without SASL authentication, if Zookeeper >> server allows it. >> 2019-09-19 22:26:25,232 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Opening socket connection to server XXXXXX.33/XXXXXX.33:2181 >> 2019-09-19 22:26:25,232 ERROR >> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - >> Authentication failed >> 2019-09-19 22:26:25,233 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Socket connection established to XXXXXX.33/XXXXXX.33:2181, initiating >> session >> 2019-09-19 22:26:25,247 INFO >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - >> Session establishment complete on server XXXXXX.33/XXXXXX.33:2181, >> sessionid = 0x301db1787060000, negotiated timeout = 40000 >> 2019-09-19 22:26:25,247 INFO >> >> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager >> - State change: RECONNECTED >> 2019-09-19 22:26:25,248 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-09-19 22:26:25,253 INFO >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >> - Connection to ZooKeeper was reconnected. Leader retrieval can be >> restarted. >> 2019-09-19 22:26:25,253 INFO >> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >> - Connection to ZooKeeper was reconnected. Leader retrieval can be >> restarted. >> 2019-09-19 22:26:25,253 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-09-19 22:26:25,253 INFO >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper was reconnected. Leader election can be restarted. >> 2019-09-19 22:26:25,253 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are >> monitored again. >> 2019-09-19 22:26:34,376 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Disassociated] >> 2019-09-19 22:26:34,376 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink-metrics@XXXXXX.11:38091] has failed, address is now >> gated for [50] ms. Reason: [Disassociated] >> 2019-09-19 22:26:35,147 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:26:35,149 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> 2019-09-19 22:26:45,167 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:26:45,168 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> 2019-09-19 22:26:55,151 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:26:55,153 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> 2019-09-19 22:27:05,159 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:27:05,160 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> 2019-09-19 22:27:15,157 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:27:15,161 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> 2019-09-19 22:27:25,152 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:27:25,160 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> 2019-09-19 22:27:35,161 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [null] failed with >> java.net.ConnectException: Connection refused: /XXXXXX.11:46167 >> 2019-09-19 22:27:35,165 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for >> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] >> Caused by: [Connection refused: /XXXXXX.11:46167] >> >> >> >> On Wed, 9 Oct 2019 at 19:44, Timothy Victor <vict...@gmail.com> wrote: >> >>> We see a very similar (if not the same) error running version 1.9 on >>> Kubernetes. So far what we have discovered is that a taskmanager gets >>> killed and a new one is created, but JM still thinks it needs to connect to >>> the old (now dead TM). I was even able to see the a taskmanager on the >>> same host and port but with different TM instance ids in the Flink UI. The >>> issue seems to be persistent (i.e. doesn't clear after a few minutes). >>> >>> FWIW...TM was dying due to livenessprobe in K8s. We have increased >>> that, but still the above issue is a concern. >>> >>> Any ideas? >>> >>> Tim >>> >>> On Wed, Oct 9, 2019, 3:15 PM John Smith <java.dev....@gmail.com> wrote: >>> >>>> Sorry been away on leave. I'll check ASAP. >>>> >>>> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wander4...@gmail.com> wrote: >>>> >>>>> Does the log you attached above come from a TaskManager Node? If so, >>>>> what state is the Job node it tried to connect to? Did it crash? >>>>> >>>>> BTW, it would be helpful if you can attach more logs of TM and JM >>>>> except >>>>> two lines said akka connection refused. >>>>> >>>>> >>>>> John Smith <java.dev....@gmail.com> 于2019年10月4日周五 上午2:08写道: >>>>> >>>>>> So I guess it had some older state? >>>>>> >>>>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I'm running standalone cluster with Zookeeper. It seems it was >>>>>>> trying to connect to an older node. I rebooted the Job node tha was >>>>>>> complaining. It seems to be ok now... >>>>>>> >>>>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes >>>>>>> >>>>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wander4...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi John, >>>>>>>> >>>>>>>> could you provide some details such as which mode you runs >>>>>>>> on(standalone/YARN) >>>>>>>> and related configuration(jobmanager.address jobmanager.port and so >>>>>>>> on)? >>>>>>>> >>>>>>>> Best, >>>>>>>> tison. >>>>>>>> >>>>>>>> >>>>>>>> John Smith <java.dev....@gmail.com> 于2019年10月3日周四 下午11:02写道: >>>>>>>> >>>>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings >>>>>>>>> in the logs... >>>>>>>>> >>>>>>>>> 2019-10-03 14:57:25,152 WARN >>>>>>>>> akka.remote.transport.netty.NettyTransport - >>>>>>>>> Remote >>>>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>>>> refused: /xxx.xxx.xxx.65:46167 >>>>>>>>> 2019-10-03 14:57:25,156 WARN >>>>>>>>> akka.remote.ReliableDeliverySupervisor - >>>>>>>>> Association with remote system [akka.tcp://fl...@xxx.xxx.xxx.65:46167] >>>>>>>>> has failed, address is now gated for [50] ms. Reason: [Association >>>>>>>>> failed >>>>>>>>> with [akka.tcp://fl...@xxx.xxx.xxx.65:46167]] Caused by: >>>>>>>>> [Connection refused: /xxx.xxx.xxx.65:46167] >>>>>>>>> >>>>>>>>> >>>>>>>>>