Hi,

You should enable and check your garbage collection log.

We've encountered case where Task Manager disassociated due to long GC pause.


Regards,

Kien

On 1/20/2018 1:27 AM, ashish pok wrote:
Hi All,

We have hit some load related issues and was wondering if any one has some suggestions. We are noticing task managers and job managers being detached from each other under load and never really sync up again. As a result, Flink session shows 0 slots available for processing. Even though, apps are configured to restart it isn't really helping as there are no slots available to run the apps.


Here are excerpt from logs that seemed relevant. (I am trimming out rest of the logs for brevity)

*Job Manager:*
2018-01-19 12:38:00,423 INFO org.apache.flink.runtime.jobmanager.JobManager   -  Starting JobManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)

2018-01-19 12:38:00,792 INFO org.apache.flink.runtime.jobmanager.JobManager   -  Maximum heap size: 16384 MiBytes 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager     -  Hadoop version: 2.6.5 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager     -  JVM Options: 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager     -     -Xms16384m 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager     -     -Xmx16384m 2018-01-19 12:38:00,795 INFO org.apache.flink.runtime.jobmanager.JobManager     -     -XX:+UseG1GC

2018-01-19 12:38:00,908 INFO org.apache.flink.configuration.GlobalConfiguration     - Loading configuration property: jobmanager.rpc.port, 6123 2018-01-19 12:38:00,908 INFO org.apache.flink.configuration.GlobalConfiguration     - Loading configuration property: jobmanager.heap.mb, 16384


2018-01-19 12:53:34,671 WARN  akka.remote.RemoteWatcher                                    - Detected unreachable: [akka.tcp://flink@<jm-host>:37840] 2018-01-19 12:53:34,676 INFO org.apache.flink.runtime.jobmanager.JobManager   - Task manager akka.tcp://flink@<jm-host>:37840/user/taskmanager terminated.

-- So once Flink session boots up, we are hitting it with pretty heavy load, which typically results in the WARN above

*Task Manager:*
2018-01-19 12:38:01,002 INFO org.apache.flink.runtime.taskmanager.TaskManager -  Starting TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC) 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager   -  Hadoop version: 2.6.5 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager   -  JVM Options: 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager   -     -Xms16384M 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager   -     -Xmx16384M 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager   -     -XX:MaxDirectMemorySize=8388607T 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager   -     -XX:+UseG1GC

2018-01-19 12:38:01,392 INFO org.apache.flink.configuration.GlobalConfiguration     - Loading configuration property: jobmanager.rpc.port, 6123 2018-01-19 12:38:01,392 INFO org.apache.flink.configuration.GlobalConfiguration     - Loading configuration property: jobmanager.heap.mb, 16384


2018-01-19 12:54:48,626 WARN  akka.remote.RemoteWatcher                                    - Detected unreachable: [akka.tcp://flink@<jm-host>:6123] 2018-01-19 12:54:48,690 INFO  akka.remote.Remoting                                   - Quarantined address [akka.tcp://flink@<jm-host>:6123] is still unreachable or has not been restarted. Keeping it quarantined. 018-01-19 12:54:48,774 WARN  akka.remote.Remoting                                     - Tried to associate with unreachable remote address [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.] 2018-01-19 12:54:48,833 WARN  akka.remote.Remoting                                     - Tried to associate with unreachable remote address [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] <bunch of ERRORs on operations not shutdown properly - assuming because JM is unreachable>

2018-01-19 12:56:51,244 INFO org.apache.flink.runtime.taskmanager.TaskManager       - Trying to register at JobManager akka.tcp://flink@<jm-host>:6123/user/jobmanager (attempt 10, timeout: 30000 milliseconds) 2018-01-19 12:56:51,253 WARN  akka.remote.Remoting                                       - Tried to associate with unreachable remote address [akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

So bottom line is, JM and TM couldn't communicate under load, which is obviously not good. I tried to bump up akka.tcp.timeout as well but it didnt help either. So my question here is after all processing is halted and there is no new data being picked up, shouldn't this environment self-heal? Any other things I can be looking at other than extending timeouts?

Thanks,

Ashish






Reply via email to