Thanks for this message. We also experience very similar issue under a heavy load. In job manager logs we see AskTimeoutExceptions. This correlates typicaly with almost 100% cpu in tak manager. Even if the job is stopped task manger is still busy for minutes or even hour acting like in `saturation` mode. We run two task managers and while is one running 100% the other is running 20% cpu, which might be the cause of overloading one task manager.
Pawel 19 sty 2018 18:28 "ashish pok" <ashish...@yahoo.com> napisaĆ(a): > Hi All, > > We have hit some load related issues and was wondering if any one has some > suggestions. We are noticing task managers and job managers being detached > from each other under load and never really sync up again. As a result, > Flink session shows 0 slots available for processing. Even though, apps are > configured to restart it isn't really helping as there are no slots > available to run the apps. > > > Here are excerpt from logs that seemed relevant. (I am trimming out rest > of the logs for brevity) > > *Job Manager:* > 2018-01-19 12:38:00,423 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager (Version: 1.4.0, Rev:3a9d9f2, > Date:06.12.2017 @ 11:08:40 UTC) > > 2018-01-19 12:38:00,792 INFO org.apache.flink.runtime.jobmanager.JobManager > - Maximum heap size: 16384 MiBytes > 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager > - Hadoop version: 2.6.5 > 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager > - JVM Options: > 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager > - -Xms16384m > 2018-01-19 12:38:00,794 INFO org.apache.flink.runtime.jobmanager.JobManager > - -Xmx16384m > 2018-01-19 12:38:00,795 INFO org.apache.flink.runtime.jobmanager.JobManager > - -XX:+UseG1GC > > 2018-01-19 12:38:00,908 INFO > org.apache.flink.configuration.GlobalConfiguration > - Loading configuration property: jobmanager.rpc.port, 6123 > 2018-01-19 12:38:00,908 INFO > org.apache.flink.configuration.GlobalConfiguration > - Loading configuration property: jobmanager.heap.mb, 16384 > > > 2018-01-19 12:53:34,671 WARN akka.remote.RemoteWatcher > - Detected unreachable: [akka.tcp://flink@<jm-host>: > 37840] > 2018-01-19 12:53:34,676 INFO org.apache.flink.runtime.jobmanager.JobManager > - Task manager akka.tcp://flink@<jm-host>:37840/user/taskmanager > terminated. > > -- So once Flink session boots up, we are hitting it with pretty heavy > load, which typically results in the WARN above > > *Task Manager:* > 2018-01-19 12:38:01,002 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager (Version: 1.4.0, Rev:3a9d9f2, > Date:06.12.2017 @ 11:08:40 UTC) > 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Hadoop version: 2.6.5 > 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JVM Options: > 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -Xms16384M > 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -Xmx16384M > 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:MaxDirectMemorySize=8388607T > 2018-01-19 12:38:01,367 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:+UseG1GC > > 2018-01-19 12:38:01,392 INFO > org.apache.flink.configuration.GlobalConfiguration > - Loading configuration property: jobmanager.rpc.port, 6123 > 2018-01-19 12:38:01,392 INFO > org.apache.flink.configuration.GlobalConfiguration > - Loading configuration property: jobmanager.heap.mb, 16384 > > > 2018-01-19 12:54:48,626 WARN akka.remote.RemoteWatcher > - Detected unreachable: [akka.tcp://flink@<jm-host>:6123] > 2018-01-19 12:54:48,690 INFO akka.remote.Remoting > - Quarantined address [akka.tcp://flink@<jm-host>:6123] > is still unreachable or has not been restarted. Keeping it quarantined. > 018-01-19 12:54:48,774 WARN akka.remote.Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: [The > remote system has a UID that has been quarantined. Association aborted.] > 2018-01-19 12:54:48,833 WARN akka.remote.Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: [The > remote system has quarantined this system. No further associations to the > remote system are possible until this system is restarted.] > <bunch of ERRORs on operations not shutdown properly - assuming because JM > is unreachable> > > 2018-01-19 12:56:51,244 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Trying to register at JobManager > akka.tcp://flink@<jm-host>:6123/user/jobmanager > (attempt 10, timeout: 30000 milliseconds) > 2018-01-19 12:56:51,253 WARN akka.remote.Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: [The > remote system has quarantined this system. No further associations to the > remote system are possible until this system is restarted.] > > So bottom line is, JM and TM couldn't communicate under load, which is > obviously not good. I tried to bump up akka.tcp.timeout as well but it > didnt help either. So my question here is after all processing is halted > and there is no new data being picked up, shouldn't this environment > self-heal? Any other things I can be looking at other than extending > timeouts? > > Thanks, > > Ashish > > > > > > >