Oh that's fine. I was just wondering why it happened. It seems to have gone
away since the reboot.

On Fri, 18 Oct 2019 at 10:43, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi John,
>
> the reason why you are seeing these warnings is because Akka tries to
> re-establish the connection to a lost endpoint (here a dead TaskExecutor).
> This should continue until the connection is either quarantined or if the
> underlying ActorRef to the remote endpoint has been garbage collected. The
> former should not really happen and the latter should happen after Flink
> has realized that the TaskExecutor has died. Flink uses its own heartbeats
> to detect this. Depending on the configuration (default value is 50s), this
> can take a bit. However, the warnings should eventually stop to be
> displayed.
>
> I admit that this is not ideal in a scenario where TaskExecutors die
> regularly but it helps to debug problematic scenarios. One way to suppress
> these statements is to set the logger for akka.remote to ERROR. But then
> one would not see if Akka has lost the connection and tries to reconnect.
>
> Cheers,
> Till
>
> On Thu, Oct 10, 2019 at 5:31 PM John Smith <java.dev....@gmail.com> wrote:
>
>> Ok so it seems there was some sort of network issue. Then leader
>> election. But it seems it had some old state and kept trying to connect to
>> the same task machine over and over...?
>>
>> 2019-09-19 22:26:14,841 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Unable to read additional data from server sessionid 0xXXXXXX, likely
>> server has closed socket, closing socket connection and attempting reconnect
>> 2019-09-19 22:26:14,946 INFO
>>  
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>  - State change: SUSPENDED
>> 2019-09-19 22:26:14,947 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper suspended. The contender http://XXXXXX-2:8081 no
>> longer participates in the leader election.
>> 2019-09-19 22:26:14,947 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper suspended. Can no longer retrieve the leader
>> from ZooKeeper.
>> 2019-09-19 22:26:14,947 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper suspended. Can no longer retrieve the leader
>> from ZooKeeper.
>> 2019-09-19 22:26:14,948 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper suspended. The contender 
>> akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
>> no longer participates in the leader election.
>> 2019-09-19 22:26:14,948 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper suspended. The contender 
>> akka.tcp://flink@fXXXXXX-2:37697/user/dispatcher
>> no longer participates in the leader election.
>> 2019-09-19 22:26:14,949 WARN
>>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are
>> not monitored (temporarily).
>> 2019-09-19 22:26:25,185 WARN
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>> configuration section named 'Client' was found in specified JAAS
>> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
>> connection to Zookeeper server without SASL authentication, if Zookeeper
>> server allows it.
>> 2019-09-19 22:26:25,186 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Opening socket connection to server XXXXXX.71/XXXXXX.71:2181
>> 2019-09-19 22:26:25,186 ERROR
>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Authentication failed
>> 2019-09-19 22:26:25,192 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Socket connection established to XXXXXX.71/XXXXXX.71:2181, initiating
>> session
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
>> expired
>> 2019-09-19 22:26:25,199 INFO
>>  
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>  - State change: LOST
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Session expired event received
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper lost. The contender http://XXXXXX-2:8081 no
>> longer participates in the leader election.
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper lost. Can no longer retrieve the leader from
>> ZooKeeper.
>> 2019-09-19 22:26:25,200 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper lost. Can no longer retrieve the leader from
>> ZooKeeper.
>> 2019-09-19 22:26:25,199 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
>> Initiating client connection,
>> connectString=XXXXXX-1.XXXXXX:2181,XXXXXX-2.XXXXXX:2181,XXXXXX-3.XXXXXX:2181
>> sessionTimeout=60000
>> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@2bec854f
>> 2019-09-19 22:26:25,200 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper lost. The contender 
>> akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
>> no longer participates in the leader election.
>> 2019-09-19 22:26:25,200 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper lost. The contender 
>> akka.tcp://flink@XXXXXX-2:37697/user/dispatcher
>> no longer participates in the leader election.
>> 2019-09-19 22:26:25,201 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
>> expired, closing socket connection
>> 2019-09-19 22:26:25,201 WARN
>>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection LOST. Changes to the submitted job graphs are not
>> monitored (permanently).
>> 2019-09-19 22:26:25,220 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> EventThread shut down for session: 0x3017fc1a6660000
>> 2019-09-19 22:26:25,231 WARN
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>> configuration section named 'Client' was found in specified JAAS
>> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
>> connection to Zookeeper server without SASL authentication, if Zookeeper
>> server allows it.
>> 2019-09-19 22:26:25,232 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Opening socket connection to server XXXXXX.33/XXXXXX.33:2181
>> 2019-09-19 22:26:25,232 ERROR
>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Authentication failed
>> 2019-09-19 22:26:25,233 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Socket connection established to XXXXXX.33/XXXXXX.33:2181, initiating
>> session
>> 2019-09-19 22:26:25,247 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Session establishment complete on server XXXXXX.33/XXXXXX.33:2181,
>> sessionid = 0x301db1787060000, negotiated timeout = 40000
>> 2019-09-19 22:26:25,247 INFO
>>  
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>  - State change: RECONNECTED
>> 2019-09-19 22:26:25,248 INFO
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper was reconnected. Leader retrieval can be
>> restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper was reconnected. Leader retrieval can be
>> restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>> monitored again.
>> 2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Disassociated]
>> 2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink-metrics@XXXXXX.11:38091] has failed, address is now
>> gated for [50] ms. Reason: [Disassociated]
>> 2019-09-19 22:26:35,147 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:26:35,149 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:26:45,167 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:26:45,168 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:26:55,151 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:26:55,153 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:05,159 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:05,160 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:15,157 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:15,161 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:25,152 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:25,160 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:35,161 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:35,165 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>>
>>
>>
>> On Wed, 9 Oct 2019 at 19:44, Timothy Victor <vict...@gmail.com> wrote:
>>
>>> We see a very similar (if not the same) error running version 1.9 on
>>> Kubernetes.   So far what we have discovered is that a taskmanager gets
>>> killed and a new one is created, but JM still thinks it needs to connect to
>>> the old (now dead TM).  I was even able to see the a taskmanager on the
>>> same host and port but with different TM instance ids in the Flink UI.  The
>>> issue seems to be persistent (i.e. doesn't clear after a few minutes).
>>>
>>> FWIW...TM was dying due to livenessprobe in K8s.   We have increased
>>> that, but still the above issue is a concern.
>>>
>>> Any ideas?
>>>
>>> Tim
>>>
>>> On Wed, Oct 9, 2019, 3:15 PM John Smith <java.dev....@gmail.com> wrote:
>>>
>>>> Sorry been away on leave. I'll check ASAP.
>>>>
>>>> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wander4...@gmail.com> wrote:
>>>>
>>>>> Does the log you attached above come from a TaskManager Node? If so,
>>>>> what state is the Job node it tried to connect to? Did it crash?
>>>>>
>>>>> BTW, it would be helpful if you can attach more logs of TM and JM
>>>>> except
>>>>> two lines said akka connection refused.
>>>>>
>>>>>
>>>>> John Smith <java.dev....@gmail.com> 于2019年10月4日周五 上午2:08写道:
>>>>>
>>>>>> So I guess it had some older state?
>>>>>>
>>>>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm running standalone cluster with Zookeeper. It seems it was
>>>>>>> trying to connect to an older node. I rebooted the Job node tha was
>>>>>>> complaining. It seems to be ok now...
>>>>>>>
>>>>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>>>>>
>>>>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wander4...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> could you provide some details such as which mode you runs
>>>>>>>> on(standalone/YARN)
>>>>>>>> and related configuration(jobmanager.address jobmanager.port and so
>>>>>>>> on)?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> tison.
>>>>>>>>
>>>>>>>>
>>>>>>>> John Smith <java.dev....@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>>>>>
>>>>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings
>>>>>>>>> in the logs...
>>>>>>>>>
>>>>>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>>>>>  akka.remote.transport.netty.NettyTransport                    - 
>>>>>>>>> Remote
>>>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>>>>>> 2019-10-03 14:57:25,156 WARN
>>>>>>>>>  akka.remote.ReliableDeliverySupervisor                        -
>>>>>>>>> Association with remote system [akka.tcp://fl...@xxx.xxx.xxx.65:46167]
>>>>>>>>> has failed, address is now gated for [50] ms. Reason: [Association 
>>>>>>>>> failed
>>>>>>>>> with [akka.tcp://fl...@xxx.xxx.xxx.65:46167]] Caused by:
>>>>>>>>> [Connection refused: /xxx.xxx.xxx.65:46167]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Reply via email to