We see a very similar (if not the same) error running version 1.9 on Kubernetes. So far what we have discovered is that a taskmanager gets killed and a new one is created, but JM still thinks it needs to connect to the old (now dead TM). I was even able to see the a taskmanager on the same host and port but with different TM instance ids in the Flink UI. The issue seems to be persistent (i.e. doesn't clear after a few minutes).
FWIW...TM was dying due to livenessprobe in K8s. We have increased that, but still the above issue is a concern. Any ideas? Tim On Wed, Oct 9, 2019, 3:15 PM John Smith <java.dev....@gmail.com> wrote: > Sorry been away on leave. I'll check ASAP. > > On Thu, 3 Oct 2019 at 20:52, Zili Chen <wander4...@gmail.com> wrote: > >> Does the log you attached above come from a TaskManager Node? If so, >> what state is the Job node it tried to connect to? Did it crash? >> >> BTW, it would be helpful if you can attach more logs of TM and JM except >> two lines said akka connection refused. >> >> >> John Smith <java.dev....@gmail.com> 于2019年10月4日周五 上午2:08写道: >> >>> So I guess it had some older state? >>> >>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <java.dev....@gmail.com> >>> wrote: >>> >>>> I'm running standalone cluster with Zookeeper. It seems it was trying >>>> to connect to an older node. I rebooted the Job node tha was complaining. >>>> It seems to be ok now... >>>> >>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes >>>> >>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wander4...@gmail.com> wrote: >>>> >>>>> Hi John, >>>>> >>>>> could you provide some details such as which mode you runs >>>>> on(standalone/YARN) >>>>> and related configuration(jobmanager.address jobmanager.port and so >>>>> on)? >>>>> >>>>> Best, >>>>> tison. >>>>> >>>>> >>>>> John Smith <java.dev....@gmail.com> 于2019年10月3日周四 下午11:02写道: >>>>> >>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings in >>>>>> the logs... >>>>>> >>>>>> 2019-10-03 14:57:25,152 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: /xxx.xxx.xxx.65:46167 >>>>>> 2019-10-03 14:57:25,156 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://fl...@xxx.xxx.xxx.65:46167] has failed, address is now >>>>>> gated for [50] ms. Reason: [Association failed with >>>>>> [akka.tcp://fl...@xxx.xxx.xxx.65:46167]] Caused by: [Connection >>>>>> refused: /xxx.xxx.xxx.65:46167] >>>>>> >>>>>> >>>>>>