We see the same in 1.4. I dont think we could see this in 1.3. I had started a 
thread a while back on this. Till asked for more details. I havent had a chance 
to get back to him on this. If you can repro this easily perhaps you can get to 
it faster. I will find the thread and resend.
Thanks,

-- Ashish 
 
  On Fri, Feb 23, 2018 at 9:56 AM, jelmer<jkupe...@gmail.com> wrote:   We found 
out there's a taskmanager.exit-on-fatal-akka-error property that will restart 
flink in this situation but it is not enabled by default and that feels like a 
rather blunt tool. I expect systems like this to be more resilient to this
On 23 February 2018 at 14:42, Aljoscha Krettek <aljos...@apache.org> wrote:

@Till Is this the expected behaviour or do you suspect something could be going 
wrong?


On 23. Feb 2018, at 08:59, jelmer <jkupe...@gmail.com> wrote:
We've observed on our flink 1.4.0 setup that if for some reason the networking 
between the task manager and the job manager gets disrupted then the task 
manager is never able to reconnect.
You'll end up with messages like this getting printed to the log repeatedly
Trying to register at JobManager akka.tcp://flink@jobmanager: 
6123/user/jobmanager (attempt 17, timeout: 30000 milliseconds)
Quarantined address [akka.tcp://flink@jobmanager: 6123] is still unreachable or 
has not been restarted. Keeping it quarantined.
Or alternatively

Tried to associate with unreachable remote address 
[akka.tcp://flink@jobmanager: 6123]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: [The remote 
system has quarantined this system. No further associations to the remote 
system are possible until this system is restarted.
But it never recovers until you either restart the job manager or the task 
manager
I was able to successfully reproduce this behaviour in two docker containers 
here :
https://github.com/jelmerk/ flink-worker-not-rejoining 
Has anyone else seen this problem ?










  

Reply via email to