Task Manager detached under load

ashish pok Fri, 19 Jan 2018 10:28:41 -0800

Hi All,
We have hit some load related issues and was wondering if any one has some 
suggestions. We are noticing task managers and job managers being detached from 
each other under load and never really sync up again. As a result, Flink 
session shows 0 slots available for processing. Even though, apps are 
configured to restart it isn't really helping as there are no slots available 
to run the apps.


Here are excerpt from logs that seemed relevant. (I am trimming out rest of the 
logs for brevity)
Job Manager:2018-01-19 12:38:00,423 INFO  
org.apache.flink.runtime.jobmanager.JobManager                -  Starting 
JobManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-01-19 12:38:00,792 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            -  Maximum heap size: 16384 MiBytes
2018-01-19 12:38:00,794 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            -  Hadoop version: 2.6.52018-01-19 12:38:00,794 INFO  
org.apache.flink.runtime.jobmanager.JobManager                -  JVM 
Options:2018-01-19 12:38:00,794 INFO  
org.apache.flink.runtime.jobmanager.JobManager                -     
-Xms16384m2018-01-19 12:38:00,794 INFO  
org.apache.flink.runtime.jobmanager.JobManager                -     
-Xmx16384m2018-01-19 12:38:00,795 INFO  
org.apache.flink.runtime.jobmanager.JobManager                -     -XX:+UseG1GC
2018-01-19 12:38:00,908 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.rpc.port, 61232018-01-19 12:38:00,908 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.heap.mb, 16384


2018-01-19 12:53:34,671 WARN  akka.remote.RemoteWatcher                         
            - Detected unreachable: 
[akka.tcp://flink@<jm-host>:37840]2018-01-19 12:53:34,676 INFO  
org.apache.flink.runtime.jobmanager.JobManager                - Task manager 
akka.tcp://flink@<jm-host>:37840/user/taskmanager terminated.
-- So once Flink session boots up, we are hitting it with pretty heavy load, 
which typically results in the WARN above
Task Manager:2018-01-19 12:38:01,002 INFO  
org.apache.flink.runtime.taskmanager.TaskManager              -  Starting 
TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-01-19 12:38:01,367 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            -  Hadoop version: 2.6.52018-01-19 12:38:01,367 INFO  
org.apache.flink.runtime.taskmanager.TaskManager              -  JVM 
Options:2018-01-19 12:38:01,367 INFO  
org.apache.flink.runtime.taskmanager.TaskManager              -     
-Xms16384M2018-01-19 12:38:01,367 INFO  
org.apache.flink.runtime.taskmanager.TaskManager              -     
-Xmx16384M2018-01-19 12:38:01,367 INFO  
org.apache.flink.runtime.taskmanager.TaskManager              -     
-XX:MaxDirectMemorySize=8388607T2018-01-19 12:38:01,367 INFO  
org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:+UseG1GC
2018-01-19 12:38:01,392 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.rpc.port, 61232018-01-19 12:38:01,392 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.heap.mb, 16384

2018-01-19 12:54:48,626 WARN  akka.remote.RemoteWatcher                         
            - Detected unreachable: [akka.tcp://flink@<jm-host>:6123]2018-01-19 
12:54:48,690 INFO  akka.remote.Remoting                                         
 - Quarantined address [akka.tcp://flink@<jm-host>:6123] is still unreachable 
or has not been restarted. Keeping it quarantined.018-01-19 12:54:48,774 WARN  
akka.remote.Remoting                                          - Tried to 
associate with unreachable remote address [akka.tcp://flink@<tm-host>:6123]. 
Address is now gated for 5000 ms, all messages to this address will be 
delivered to dead letters. Reason: [The remote system has a UID that has been 
quarantined. Association aborted.] 2018-01-19 12:54:48,833 WARN  
akka.remote.Remoting                                          - Tried to 
associate with unreachable remote address [akka.tcp://flink@<tm-host>:6123]. 
Address is now gated for 5000 ms, all messages to this address will be 
delivered to dead letters. Reason: [The remote system has quarantined this 
system. No further associations to the remote system are possible until this 
system is restarted.] <bunch of ERRORs on operations not shutdown properly - 
assuming because JM is unreachable>
2018-01-19 12:56:51,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - Trying to register at JobManager 
akka.tcp://flink@<jm-host>:6123/user/jobmanager (attempt 10, timeout: 30000 
milliseconds)2018-01-19 12:56:51,253 WARN  akka.remote.Remoting                 
                         - Tried to associate with unreachable remote address 
[akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: [The remote 
system has quarantined this system. No further associations to the remote 
system are possible until this system is restarted.] 
So bottom line is, JM and TM couldn't communicate under load, which is 
obviously not good. I tried to bump up akka.tcp.timeout as well but it didnt 
help either. So my question here is after all processing is halted and there is 
no new data being picked up, shouldn't this environment self-heal? Any other 
things I can be looking at other than extending timeouts?
Thanks,
Ashish

Task Manager detached under load

Reply via email to