We've observed on our flink 1.4.0 setup that if for some reason the
networking between the task manager and the job manager gets disrupted then
the task manager is never able to reconnect.

You'll end up with messages like this getting printed to the log repeatedly

Trying to register at JobManager
akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 17, timeout:
30000 milliseconds)
Quarantined address [akka.tcp://flink@jobmanager:6123] is still
unreachable or has not been restarted. Keeping it quarantined.


Or alternatively


Tried to associate with unreachable remote address
[akka.tcp://flink@jobmanager:6123]. Address is now gated for 5000 ms,
all messages to this address will be delivered to dead letters.
Reason: [The remote system has quarantined this system. No further
associations to the remote system are possible until this system is
restarted.


But it never recovers until you either restart the job manager or the task
manager

I was able to successfully reproduce this behaviour in two docker
containers here :

https://github.com/jelmerk/flink-worker-not-rejoining

Has anyone else seen this problem ?

Reply via email to