Hi,
You should enable and check your garbage collection log.
We've encountered case where Task Manager disassociated due to long GC
pause.
Regards,
Kien
On 1/20/2018 1:27 AM, ashish pok wrote:
Hi All,
We have hit some load related issues and was wondering if any one has
some suggestions. We are noticing task managers and job managers being
detached from each other under load and never really sync up again. As
a result, Flink session shows 0 slots available for processing. Even
though, apps are configured to restart it isn't really helping as
there are no slots available to run the apps.
Here are excerpt from logs that seemed relevant. (I am trimming out
rest of the logs for brevity)
*Job Manager:*
2018-01-19 12:38:00,423 INFO
org.apache.flink.runtime.jobmanager.JobManager - Starting
JobManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-01-19 12:38:00,792 INFO
org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size:
16384 MiBytes
2018-01-19 12:38:00,794 INFO
org.apache.flink.runtime.jobmanager.JobManager - Hadoop version:
2.6.5
2018-01-19 12:38:00,794 INFO
org.apache.flink.runtime.jobmanager.JobManager - JVM Options:
2018-01-19 12:38:00,794 INFO
org.apache.flink.runtime.jobmanager.JobManager - -Xms16384m
2018-01-19 12:38:00,794 INFO
org.apache.flink.runtime.jobmanager.JobManager - -Xmx16384m
2018-01-19 12:38:00,795 INFO
org.apache.flink.runtime.jobmanager.JobManager - -XX:+UseG1GC
2018-01-19 12:38:00,908 INFO
org.apache.flink.configuration.GlobalConfiguration - Loading
configuration property: jobmanager.rpc.port, 6123
2018-01-19 12:38:00,908 INFO
org.apache.flink.configuration.GlobalConfiguration - Loading
configuration property: jobmanager.heap.mb, 16384
2018-01-19 12:53:34,671 WARN akka.remote.RemoteWatcher
- Detected unreachable:
[akka.tcp://flink@<jm-host>:37840]
2018-01-19 12:53:34,676 INFO
org.apache.flink.runtime.jobmanager.JobManager - Task manager
akka.tcp://flink@<jm-host>:37840/user/taskmanager terminated.
-- So once Flink session boots up, we are hitting it with pretty heavy
load, which typically results in the WARN above
*Task Manager:*
2018-01-19 12:38:01,002 INFO
org.apache.flink.runtime.taskmanager.TaskManager - Starting
TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-01-19 12:38:01,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager - Hadoop version:
2.6.5
2018-01-19 12:38:01,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager - JVM Options:
2018-01-19 12:38:01,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager - -Xms16384M
2018-01-19 12:38:01,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager - -Xmx16384M
2018-01-19 12:38:01,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager -
-XX:MaxDirectMemorySize=8388607T
2018-01-19 12:38:01,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC
2018-01-19 12:38:01,392 INFO
org.apache.flink.configuration.GlobalConfiguration - Loading
configuration property: jobmanager.rpc.port, 6123
2018-01-19 12:38:01,392 INFO
org.apache.flink.configuration.GlobalConfiguration - Loading
configuration property: jobmanager.heap.mb, 16384
2018-01-19 12:54:48,626 WARN akka.remote.RemoteWatcher
- Detected unreachable:
[akka.tcp://flink@<jm-host>:6123]
2018-01-19 12:54:48,690 INFO akka.remote.Remoting
- Quarantined address [akka.tcp://flink@<jm-host>:6123]
is still unreachable or has not been restarted. Keeping it quarantined.
018-01-19 12:54:48,774 WARN akka.remote.Remoting
- Tried to associate with unreachable remote address
[akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms,
all messages to this address will be delivered to dead letters.
Reason: [The remote system has a UID that has been quarantined.
Association aborted.]
2018-01-19 12:54:48,833 WARN akka.remote.Remoting
- Tried to associate with unreachable remote address
[akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms,
all messages to this address will be delivered to dead letters.
Reason: [The remote system has quarantined this system. No further
associations to the remote system are possible until this system is
restarted.]
<bunch of ERRORs on operations not shutdown properly - assuming
because JM is unreachable>
2018-01-19 12:56:51,244 INFO
org.apache.flink.runtime.taskmanager.TaskManager - Trying to
register at JobManager akka.tcp://flink@<jm-host>:6123/user/jobmanager
(attempt 10, timeout: 30000 milliseconds)
2018-01-19 12:56:51,253 WARN akka.remote.Remoting
- Tried to associate with unreachable remote address
[akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms,
all messages to this address will be delivered to dead letters.
Reason: [The remote system has quarantined this system. No further
associations to the remote system are possible until this system is
restarted.]
So bottom line is, JM and TM couldn't communicate under load, which is
obviously not good. I tried to bump up akka.tcp.timeout as well but it
didnt help either. So my question here is after all processing is
halted and there is no new data being picked up, shouldn't this
environment self-heal? Any other things I can be looking at other than
extending timeouts?
Thanks,
Ashish