We see a very similar (if not the same) error running version 1.9 on
Kubernetes.   So far what we have discovered is that a taskmanager gets
killed and a new one is created, but JM still thinks it needs to connect to
the old (now dead TM).  I was even able to see the a taskmanager on the
same host and port but with different TM instance ids in the Flink UI.  The
issue seems to be persistent (i.e. doesn't clear after a few minutes).

FWIW...TM was dying due to livenessprobe in K8s.   We have increased that,
but still the above issue is a concern.

Any ideas?

Tim

On Wed, Oct 9, 2019, 3:15 PM John Smith <java.dev....@gmail.com> wrote:

> Sorry been away on leave. I'll check ASAP.
>
> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wander4...@gmail.com> wrote:
>
>> Does the log you attached above come from a TaskManager Node? If so,
>> what state is the Job node it tried to connect to? Did it crash?
>>
>> BTW, it would be helpful if you can attach more logs of TM and JM except
>> two lines said akka connection refused.
>>
>>
>> John Smith <java.dev....@gmail.com> 于2019年10月4日周五 上午2:08写道:
>>
>>> So I guess it had some older state?
>>>
>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> I'm running standalone cluster with Zookeeper. It seems it was trying
>>>> to connect to an older node. I rebooted the Job node tha was complaining.
>>>> It seems to be ok now...
>>>>
>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>>
>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wander4...@gmail.com> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> could you provide some details such as which mode you runs
>>>>> on(standalone/YARN)
>>>>> and related configuration(jobmanager.address jobmanager.port and so
>>>>> on)?
>>>>>
>>>>> Best,
>>>>> tison.
>>>>>
>>>>>
>>>>> John Smith <java.dev....@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>>
>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings in
>>>>>> the logs...
>>>>>>
>>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://fl...@xxx.xxx.xxx.65:46167] has failed, address is now
>>>>>> gated for [50] ms. Reason: [Association failed with
>>>>>> [akka.tcp://fl...@xxx.xxx.xxx.65:46167]] Caused by: [Connection
>>>>>> refused: /xxx.xxx.xxx.65:46167]
>>>>>>
>>>>>>
>>>>>>

Reply via email to