The TaskManagers were nixed by the OOM killer. [63896.699500] Out of memory: Kill process 12892 (java) score 910 or sacrifice child [63896.702018] Killed process 12892 (java) total-vm:47398740kB, anon-rss:28487812kB, file-rss:8kB
The cluster is comprised of AWS c4.8xlarge instances which have 60 GiB of memory across two NUMA nodes (~32.2 GB each). Pertinent TaskManager configuration: taskmanager.memory.off-heap: true taskmanager.memory.segment-size: 16384 taskmanager.heap.mb: 18000 taskmanager.network.numberOfBuffers: 414720 This was allocating 18 GB plus up to 6.8 GB for network buffers. As Max noted in FLINK-2865, "I think the maximum number of network memory can never exceed 2 * (network memory). In this case all network buffers would be inside the Netty buffer pool." Doubling the 6.8 GB exceeds the node memory. Would disabling off-heap memory cause the network buffers to be re-used by Netty and save half of the network buffer memory? I created FLINK-3164 which would reduce the number of necessary network buffers. Greg Hogan On Fri, Oct 30, 2015 at 12:33 PM, Till Rohrmann <trohrm...@apache.org> wrote: > The logging of the TaskManager stops 3 seconds before the JobManager > detects that the connection to the TaskManager is failed. If the clocks are > remotely in sync and the TaskManager is still running, then we should also > see logging statements for the time after the connection has failed. > Therefore, I would also suspect that something happened to the TaskManager > JVM. > > Cheers, > Till > > On Fri, Oct 30, 2015 at 3:43 AM, Robert Metzger <rmetz...@apache.org> > wrote: > > > So is the TaskManager JVM still running after the JM detected that the TM > > has gone? > > > > If not, can you check the kernel log (dmesg) to see whether Linux OOM > > killer stopped the process? (if its a kill, the JVM might not be able to > > log anything anymore) > > > > On Thu, Oct 29, 2015 at 9:27 PM, Stephan Ewen <se...@apache.org> wrote: > > > > > Thanks for sharing the logs, Greg! > > > > > > Okay, so the TaskManager does not crash, but the Remote Failure > Detector > > of > > > Akka marks the connection between JobManager and TaskManager as broken. > > > > > > The TaskManager is not doing much GC, so it is not a long JVM freeze > that > > > causes hearbeats to time out... > > > > > > I am wondering at this point whether this is an issue in Akka, > > specifically > > > the remote death watch that we use to let the JobManager recognize > > > disconnected TaskManagers. > > > > > > One thing you could try is actually to comment out the line where the > > > JobManager starts the death watch for the TaskManager and see if they > can > > > still successfully exchange messages (tasks finished, find inputs, > > > schedule) and the program completes. That would indicate that the Akka > > > Death Watch is flawed and that we should probably do our own heartbeats > > > instead. > > > > > > Greetings, > > > Stephan > > > > > > > > > On Thu, Oct 29, 2015 at 11:44 AM, Aljoscha Krettek < > aljos...@apache.org> > > > wrote: > > > > > > > Could it be a problem that there are two TaskManagers running per > > > machine? > > > > > > > > > On 29 Oct 2015, at 19:04, Greg Hogan <c...@greghogan.com> wrote: > > > > > > > > > > I have memory logging enabled. Tail of TaskManager log on > > 10.0.88.140: > > > > > > > > > > 17:35:26,415 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:27,415 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 576/1917/1917 MB, NON HEAP: 56/58/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:27,415 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:27,415 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:28,012 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (938/2322) > > > > > 17:35:28,015 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (938/2322) > > > > > 17:35:28,016 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (938/2322) [DEPLOYING] > > > > > 17:35:28,065 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (938/2322) > > > switched > > > > to > > > > > RUNNING > > > > > 17:35:28,100 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2304/2322) > > > > > 17:35:28,116 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2304/2322) > > > > > 17:35:28,116 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2304/2322) [DEPLOYING] > > > > > 17:35:28,132 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2304/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,255 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (939/2322) > > > > > 17:35:28,263 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (939/2322) > > > > > 17:35:28,263 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (939/2322) [DEPLOYING] > > > > > 17:35:28,304 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2062/2322) > > > > > 17:35:28,311 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2062/2322) > > > > > 17:35:28,311 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2062/2322) [DEPLOYING] > > > > > 17:35:28,323 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (939/2322) > > > switched > > > > to > > > > > RUNNING > > > > > 17:35:28,386 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2062/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,396 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1775/2322) > > > > > 17:35:28,401 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1775/2322) > > > > > 17:35:28,402 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1775/2322) [DEPLOYING] > > > > > 17:35:28,416 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 747/1917/1917 MB, NON HEAP: 56/58/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:28,416 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:28,416 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:28,419 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1775/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,475 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2158/2322) > > > > > 17:35:28,475 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2158/2322) > > > > > 17:35:28,476 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2158/2322) [DEPLOYING] > > > > > 17:35:28,509 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1463/2322) > > > > > 17:35:28,860 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1463/2322) > > > > > 17:35:28,861 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1463/2322) [DEPLOYING] > > > > > 17:35:28,862 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2158/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,878 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1463/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,892 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1154/2322) > > > > > 17:35:28,893 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1154/2322) > > > > > 17:35:28,893 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1154/2322) [DEPLOYING] > > > > > 17:35:28,914 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1154/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,916 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1429/2322) > > > > > 17:35:28,917 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1429/2322) > > > > > 17:35:28,917 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1429/2322) [DEPLOYING] > > > > > 17:35:28,942 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1078/2322) > > > > > 17:35:28,942 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1078/2322) > > > > > 17:35:28,942 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1078/2322) [DEPLOYING] > > > > > 17:35:28,943 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1429/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,955 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1078/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:28,959 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (524/2322) > > > > > 17:35:28,995 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (524/2322) > > > > > 17:35:28,995 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (524/2322) [DEPLOYING] > > > > > 17:35:29,000 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2021/2322) > > > > > 17:35:29,000 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2021/2322) > > > > > 17:35:29,000 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2021/2322) [DEPLOYING] > > > > > 17:35:29,012 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (524/2322) > > > switched > > > > to > > > > > RUNNING > > > > > 17:35:29,039 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2022/2322) > > > > > 17:35:29,039 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2021/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:29,043 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2022/2322) > > > > > 17:35:29,043 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2022/2322) [DEPLOYING] > > > > > 17:35:29,076 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1464/2322) > > > > > 17:35:29,081 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1464/2322) > > > > > 17:35:29,081 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1464/2322) [DEPLOYING] > > > > > 17:35:29,095 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2022/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:29,108 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (1095/2322) > > > > > 17:35:29,110 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1464/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:29,112 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (1095/2322) > > > > > 17:35:29,112 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (1095/2322) [DEPLOYING] > > > > > 17:35:29,140 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2306/2322) > > > > > 17:35:29,142 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2306/2322) > > > > > 17:35:29,142 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2306/2322) [DEPLOYING] > > > > > 17:35:29,147 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1095/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:29,152 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (974/2322) > > > > > 17:35:29,153 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2306/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:29,155 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (974/2322) > > > > > 17:35:29,155 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (974/2322) [DEPLOYING] > > > > > 17:35:29,166 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Received > > > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > > > (2305/2322) > > > > > 17:35:29,167 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > Loading > > > > JAR > > > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > > > (checksum()) > > > > > (2305/2322) > > > > > 17:35:29,167 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > > > > Registering > > > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > > > (checksum()) > > > > > (2305/2322) [DEPLOYING] > > > > > 17:35:29,176 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (974/2322) > > > switched > > > > to > > > > > RUNNING > > > > > 17:35:29,205 INFO > > > > > org.apache.flink.runtime.taskmanager.Task - > CHAIN > > > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2305/2322) > > > switched > > > > > to RUNNING > > > > > 17:35:29,417 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 590/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:29,417 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:29,417 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:30,418 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 614/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:30,418 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:30,418 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:31,418 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 634/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:31,418 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:31,419 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:32,419 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 638/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:32,419 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:32,419 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:33,487 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 648/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:33,494 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:33,522 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:34,523 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 662/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:34,523 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:34,523 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:35,523 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 670/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:35,524 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:35,524 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:36,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 717/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:36,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:36,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:37,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 737/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:37,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:37,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:38,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 747/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:38,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:38,525 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:39,526 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 817/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:39,526 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:39,526 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:40,526 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 832/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:40,526 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:40,526 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:41,527 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 840/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:41,527 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:41,527 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:42,527 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 847/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:42,527 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:42,527 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:43,599 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 450/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:43,599 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:43,599 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:44,599 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 508/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:44,599 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:44,599 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:45,600 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 517/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:45,600 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:45,600 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:46,600 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 528/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:46,600 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:46,600 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:47,663 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 541/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:47,664 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:47,664 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:48,791 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 554/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:48,791 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:48,791 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:49,794 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 562/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:49,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:49,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:50,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 569/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:50,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:50,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:51,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 582/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:51,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:51,795 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:52,796 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 593/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:52,796 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:52,796 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:53,796 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 600/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:53,796 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:53,796 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:54,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 604/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:54,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:54,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:55,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 610/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:55,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:55,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:56,797 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 615/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:56,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:56,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:57,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 624/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:57,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:57,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:58,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 636/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:58,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:58,798 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:35:59,799 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 641/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:35:59,799 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:35:59,799 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:36:00,799 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 648/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > > > (used/committed/max)] > > > > > 17:36:00,799 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:36:00,799 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:36:01,821 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 655/1917/1917 MB, NON HEAP: 58/60/-1 MB > > > > > (used/committed/max)] > > > > > 17:36:01,936 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/35/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:36:01,936 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:36:02,937 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 665/1917/1917 MB, NON HEAP: 58/60/-1 MB > > > > > (used/committed/max)] > > > > > 17:36:02,937 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/35/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:36:02,937 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > 17:36:03,944 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Memory > > > > > usage stats: [HEAP: 666/1917/1917 MB, NON HEAP: 58/60/-1 MB > > > > > (used/committed/max)] > > > > > 17:36:03,950 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > > Off-heap > > > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > > > [Metaspace: > > > > > 34/35/-1 MB (used/committed/max)], [Compressed Class Space: > 4/4/1024 > > MB > > > > > (used/committed/max)] > > > > > 17:36:03,951 INFO > > > > > org.apache.flink.runtime.taskmanager.TaskManager - > > Garbage > > > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > > > > > > On Thu, Oct 29, 2015 at 1:55 PM, Till Rohrmann < > trohrm...@apache.org > > > > > > > wrote: > > > > > > > > > >> What does the log of the failed TaskManager 10.0.88.140 say? > > > > >> > > > > >> On Thu, Oct 29, 2015 at 6:44 PM, Greg Hogan <c...@greghogan.com> > > > wrote: > > > > >> > > > > >>> I removed the use of numactl but left in starting two > TaskManagers > > > and > > > > am > > > > >>> still seeing TaskManagers crash. > > > > >>> From the JobManager log: > > > > >>> > > > > >>> 17:36:06,412 WARN > > > > >>> akka.remote.ReliableDeliverySupervisor - > > > > >> Association > > > > >>> with remote system [akka.tcp://flink@10.0.88.140:45742] has > > failed, > > > > >>> address > > > > >>> is now gated for [5000] ms. Reason is: [Disassociated]. > > > > >>> 17:36:06,567 INFO > > > > >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - > > CHAIN > > > > >>> GroupReduce (Compute scores) -> FlatMap (checksum()) (370/2322) > > > > >>> (cac9927a8568c2ad79439262a91478af) switched from RUNNING to > FAILED > > > > >>> 17:36:06,572 INFO > > > > >>> org.apache.flink.runtime.jobmanager.JobManager - > > > Status > > > > of > > > > >>> job 14d946015fd7b35eb801ea6fee5af9e4 (Flink Java Job at Thu Oct > 29 > > > > >> 17:34:48 > > > > >>> UTC 2015) changed to FAILING. > > > > >>> java.lang.Exception: The data preparation for task 'CHAIN > > GroupReduce > > > > >>> (Compute scores) -> FlatMap (checksum())' , caused an error: > Error > > > > >>> obtaining the sorted input: Thread 'SortMerger Reading Thread' > > > > terminated > > > > >>> due to an exception: Connection unexpectedly closed by remote > task > > > > >> manager > > > > >>> 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that the > > > > remote > > > > >>> task manager was lost. > > > > >>> at > > > > >>> > > org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465) > > > > >>> at > > > > >>> > > > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354) > > > > >>> at > > > org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > > > > >>> at java.lang.Thread.run(Thread.java:745) > > > > >>> Caused by: java.lang.RuntimeException: Error obtaining the sorted > > > > input: > > > > >>> Thread 'SortMerger Reading Thread' terminated due to an > exception: > > > > >>> Connection unexpectedly closed by remote task manager > > > 'ip-10-0-88-140/ > > > > >>> 10.0.88.140:58558'. This might indicate that the remote task > > manager > > > > was > > > > >>> lost. > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) > > > > >>> at > > > > >>> > > > > >> > > > > > > > > > > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1089) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94) > > > > >>> at > > > > >>> > > org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459) > > > > >>> ... 3 more > > > > >>> Caused by: java.io.IOException: Thread 'SortMerger Reading > Thread' > > > > >>> terminated due to an exception: Connection unexpectedly closed by > > > > remote > > > > >>> task manager 'ip-10-0-88-140/10.0.88.140:58558'. This might > > indicate > > > > >> that > > > > >>> the remote task manager was lost. > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) > > > > >>> Caused by: > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > > > > >>> Connection unexpectedly closed by remote task manager > > > 'ip-10-0-88-140/ > > > > >>> 10.0.88.140:58558'. This might indicate that the remote task > > manager > > > > was > > > > >>> lost. > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) > > > > >>> at > > > io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > > > > >>> at > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) > > > > >>> at java.lang.Thread.run(Thread.java:745) > > > > >>> 17:36:06,587 INFO > > > > >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - > > CHAIN > > > > >>> GroupReduce (Compute scores) -> FlatMap (checksum()) (367/2322) > > > > >>> (d63c681a18b8164bc24936df1ecb159b) switched from RUNNING to > FAILED > > > > >>> > > > > >>> > > > > >>> On Thu, Oct 29, 2015 at 1:00 PM, Stephan Ewen <se...@apache.org> > > > > wrote: > > > > >>> > > > > >>>> Hi Greg! > > > > >>>> > > > > >>>> Interesting... When you say the TaskManagers are dropping, are > the > > > > >>>> TaskManager processes crashing, or are they loosing connection > to > > > the > > > > >>>> JobManager? > > > > >>>> > > > > >>>> Greetings, > > > > >>>> Stephan > > > > >>>> > > > > >>>> > > > > >>>> On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <c...@greghogan.com > > > > > > >> wrote: > > > > >>>> > > > > >>>>> I recently discovered that AWS uses NUMA for its largest nodes. > > An > > > > >>>> example > > > > >>>>> c4.8xlarge: > > > > >>>>> > > > > >>>>> $ numactl --hardware > > > > >>>>> available: 2 nodes (0-1) > > > > >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 > > > > >>>>> node 0 size: 29813 MB > > > > >>>>> node 0 free: 24537 MB > > > > >>>>> node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 > 35 > > > > >>>>> node 1 size: 30574 MB > > > > >>>>> node 1 free: 22757 MB > > > > >>>>> node distances: > > > > >>>>> node 0 1 > > > > >>>>> 0: 10 20 > > > > >>>>> 1: 20 10 > > > > >>>>> > > > > >>>>> I discovered yesterday that Flink performed ~20-30% faster on > > large > > > > >>>>> datasets by running two NUMA-constrained TaskManagers per node. > > The > > > > >>>>> JobManager node ran a single TaskManager. Resources were > divided > > in > > > > >>> half > > > > >>>>> relative to running a single TaskManager. > > > > >>>>> > > > > >>>>> The changes from the tail of /bin/taskmanager.sh: > > > > >>>>> > > > > >>>>> -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager > > > > >> "${args[@]}" > > > > >>>>> +numactl --membind=0 --cpunodebind=0 > > > > >> "${FLINK_BIN_DIR}"/flink-daemon.sh > > > > >>>>> $STARTSTOP taskmanager "${args[@]}" > > > > >>>>> +numactl --membind=1 --cpunodebind=1 > > > > >> "${FLINK_BIN_DIR}"/flink-daemon.sh > > > > >>>>> $STARTSTOP taskmanager "${args[@]}" > > > > >>>>> > > > > >>>>> After reverting this change the system is again stable. I had > not > > > > >>>>> experienced issues using numactl when running 16 nodes. > > > > >>>>> > > > > >>>>> Greg > > > > >>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > > > > >