Re: Task Manager detached under load

Kien Truong Sat, 20 Jan 2018 03:57:49 -0800

Hi,

You should enable and check your garbage collection log.

We've encountered case where Task Manager disassociated due to long GCpause.



Regards,

Kien

On 1/20/2018 1:27 AM, ashish pok wrote:

Hi All,
We have hit some load related issues and was wondering if any one hassome suggestions. We are noticing task managers and job managers beingdetached from each other under load and never really sync up again. Asa result, Flink session shows 0 slots available for processing. Eventhough, apps are configured to restart it isn't really helping asthere are no slots available to run the apps.
Here are excerpt from logs that seemed relevant. (I am trimming outrest of the logs for brevity)
*Job Manager:*
2018-01-19 12:38:00,423 INFOorg.apache.flink.runtime.jobmanager.JobManager - StartingJobManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-01-19 12:38:00,792 INFOorg.apache.flink.runtime.jobmanager.JobManager - Maximum heap size:16384 MiBytes2018-01-19 12:38:00,794 INFOorg.apache.flink.runtime.jobmanager.JobManager - Hadoop version:2.6.52018-01-19 12:38:00,794 INFOorg.apache.flink.runtime.jobmanager.JobManager - JVM Options:2018-01-19 12:38:00,794 INFOorg.apache.flink.runtime.jobmanager.JobManager - -Xms16384m2018-01-19 12:38:00,794 INFOorg.apache.flink.runtime.jobmanager.JobManager - -Xmx16384m2018-01-19 12:38:00,795 INFOorg.apache.flink.runtime.jobmanager.JobManager - -XX:+UseG1GC
2018-01-19 12:38:00,908 INFOorg.apache.flink.configuration.GlobalConfiguration - Loadingconfiguration property: jobmanager.rpc.port, 61232018-01-19 12:38:00,908 INFOorg.apache.flink.configuration.GlobalConfiguration - Loadingconfiguration property: jobmanager.heap.mb, 16384
2018-01-19 12:53:34,671 WARN akka.remote.RemoteWatcher - Detected unreachable:[akka.tcp://flink@<jm-host>:37840]2018-01-19 12:53:34,676 INFOorg.apache.flink.runtime.jobmanager.JobManager - Task managerakka.tcp://flink@<jm-host>:37840/user/taskmanager terminated.
-- So once Flink session boots up, we are hitting it with pretty heavyload, which typically results in the WARN above
*Task Manager:*
2018-01-19 12:38:01,002 INFOorg.apache.flink.runtime.taskmanager.TaskManager - StartingTaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)2018-01-19 12:38:01,367 INFOorg.apache.flink.runtime.taskmanager.TaskManager - Hadoop version:2.6.52018-01-19 12:38:01,367 INFOorg.apache.flink.runtime.taskmanager.TaskManager - JVM Options:2018-01-19 12:38:01,367 INFOorg.apache.flink.runtime.taskmanager.TaskManager - -Xms16384M2018-01-19 12:38:01,367 INFOorg.apache.flink.runtime.taskmanager.TaskManager - -Xmx16384M2018-01-19 12:38:01,367 INFOorg.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=8388607T2018-01-19 12:38:01,367 INFOorg.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC
2018-01-19 12:38:01,392 INFOorg.apache.flink.configuration.GlobalConfiguration - Loadingconfiguration property: jobmanager.rpc.port, 61232018-01-19 12:38:01,392 INFOorg.apache.flink.configuration.GlobalConfiguration - Loadingconfiguration property: jobmanager.heap.mb, 16384
2018-01-19 12:54:48,626 WARN akka.remote.RemoteWatcher - Detected unreachable:[akka.tcp://flink@<jm-host>:6123]2018-01-19 12:54:48,690 INFO akka.remote.Remoting - Quarantined address [akka.tcp://flink@<jm-host>:6123]is still unreachable or has not been restarted. Keeping it quarantined.018-01-19 12:54:48,774 WARN akka.remote.Remoting - Tried to associate with unreachable remote address[akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms,all messages to this address will be delivered to dead letters.Reason: [The remote system has a UID that has been quarantined.Association aborted.]2018-01-19 12:54:48,833 WARN akka.remote.Remoting - Tried to associate with unreachable remote address[akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms,all messages to this address will be delivered to dead letters.Reason: [The remote system has quarantined this system. No furtherassociations to the remote system are possible until this system isrestarted.]<bunch of ERRORs on operations not shutdown properly - assumingbecause JM is unreachable>
2018-01-19 12:56:51,244 INFOorg.apache.flink.runtime.taskmanager.TaskManager - Trying toregister at JobManager akka.tcp://flink@<jm-host>:6123/user/jobmanager(attempt 10, timeout: 30000 milliseconds)2018-01-19 12:56:51,253 WARN akka.remote.Remoting - Tried to associate with unreachable remote address[akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms,all messages to this address will be delivered to dead letters.Reason: [The remote system has quarantined this system. No furtherassociations to the remote system are possible until this system isrestarted.]
So bottom line is, JM and TM couldn't communicate under load, which isobviously not good. I tried to bump up akka.tcp.timeout as well but itdidnt help either. So my question here is after all processing ishalted and there is no new data being picked up, shouldn't thisenvironment self-heal? Any other things I can be looking at other thanextending timeouts?
Thanks,

Ashish

Re: Task Manager detached under load

Reply via email to