Could it be a problem that there are two TaskManagers running per machine?
> On 29 Oct 2015, at 19:04, Greg Hogan <c...@greghogan.com> wrote:
>
> I have memory logging enabled. Tail of TaskManager log on 10.0.88.140:
>
> 17:35:26,415 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:27,415 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 576/1917/1917 MB, NON HEAP: 56/58/-1 MB
> (used/committed/max)]
> 17:35:27,415 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:27,415 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:28,012 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (938/2322)
> 17:35:28,015 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (938/2322)
> 17:35:28,016 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (938/2322) [DEPLOYING]
> 17:35:28,065 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (938/2322) switched to
> RUNNING
> 17:35:28,100 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2304/2322)
> 17:35:28,116 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2304/2322)
> 17:35:28,116 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2304/2322) [DEPLOYING]
> 17:35:28,132 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2304/2322) switched
> to RUNNING
> 17:35:28,255 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (939/2322)
> 17:35:28,263 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (939/2322)
> 17:35:28,263 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (939/2322) [DEPLOYING]
> 17:35:28,304 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2062/2322)
> 17:35:28,311 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2062/2322)
> 17:35:28,311 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2062/2322) [DEPLOYING]
> 17:35:28,323 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (939/2322) switched to
> RUNNING
> 17:35:28,386 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2062/2322) switched
> to RUNNING
> 17:35:28,396 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1775/2322)
> 17:35:28,401 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1775/2322)
> 17:35:28,402 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1775/2322) [DEPLOYING]
> 17:35:28,416 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 747/1917/1917 MB, NON HEAP: 56/58/-1 MB
> (used/committed/max)]
> 17:35:28,416 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:28,416 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:28,419 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1775/2322) switched
> to RUNNING
> 17:35:28,475 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2158/2322)
> 17:35:28,475 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2158/2322)
> 17:35:28,476 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2158/2322) [DEPLOYING]
> 17:35:28,509 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1463/2322)
> 17:35:28,860 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1463/2322)
> 17:35:28,861 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1463/2322) [DEPLOYING]
> 17:35:28,862 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2158/2322) switched
> to RUNNING
> 17:35:28,878 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1463/2322) switched
> to RUNNING
> 17:35:28,892 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1154/2322)
> 17:35:28,893 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1154/2322)
> 17:35:28,893 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1154/2322) [DEPLOYING]
> 17:35:28,914 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1154/2322) switched
> to RUNNING
> 17:35:28,916 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1429/2322)
> 17:35:28,917 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1429/2322)
> 17:35:28,917 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1429/2322) [DEPLOYING]
> 17:35:28,942 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1078/2322)
> 17:35:28,942 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1078/2322)
> 17:35:28,942 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1078/2322) [DEPLOYING]
> 17:35:28,943 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1429/2322) switched
> to RUNNING
> 17:35:28,955 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1078/2322) switched
> to RUNNING
> 17:35:28,959 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (524/2322)
> 17:35:28,995 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (524/2322)
> 17:35:28,995 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (524/2322) [DEPLOYING]
> 17:35:29,000 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2021/2322)
> 17:35:29,000 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2021/2322)
> 17:35:29,000 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2021/2322) [DEPLOYING]
> 17:35:29,012 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (524/2322) switched to
> RUNNING
> 17:35:29,039 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2022/2322)
> 17:35:29,039 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2021/2322) switched
> to RUNNING
> 17:35:29,043 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2022/2322)
> 17:35:29,043 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2022/2322) [DEPLOYING]
> 17:35:29,076 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1464/2322)
> 17:35:29,081 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1464/2322)
> 17:35:29,081 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1464/2322) [DEPLOYING]
> 17:35:29,095 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2022/2322) switched
> to RUNNING
> 17:35:29,108 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (1095/2322)
> 17:35:29,110 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1464/2322) switched
> to RUNNING
> 17:35:29,112 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1095/2322)
> 17:35:29,112 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (1095/2322) [DEPLOYING]
> 17:35:29,140 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2306/2322)
> 17:35:29,142 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2306/2322)
> 17:35:29,142 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2306/2322) [DEPLOYING]
> 17:35:29,147 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (1095/2322) switched
> to RUNNING
> 17:35:29,152 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (974/2322)
> 17:35:29,153 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2306/2322) switched
> to RUNNING
> 17:35:29,155 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (974/2322)
> 17:35:29,155 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (974/2322) [DEPLOYING]
> 17:35:29,166 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Received
> task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (2305/2322)
> 17:35:29,167 INFO
> org.apache.flink.runtime.taskmanager.Task - Loading JAR
> files for task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2305/2322)
> 17:35:29,167 INFO
> org.apache.flink.runtime.taskmanager.Task - Registering
> task at network: CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())
> (2305/2322) [DEPLOYING]
> 17:35:29,176 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (974/2322) switched to
> RUNNING
> 17:35:29,205 INFO
> org.apache.flink.runtime.taskmanager.Task - CHAIN
> GroupReduce (Compute scores) -> FlatMap (checksum()) (2305/2322) switched
> to RUNNING
> 17:35:29,417 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 590/1917/1917 MB, NON HEAP: 57/59/-1 MB
> (used/committed/max)]
> 17:35:29,417 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:29,417 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:30,418 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 614/1917/1917 MB, NON HEAP: 57/59/-1 MB
> (used/committed/max)]
> 17:35:30,418 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:30,418 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:31,418 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 634/1917/1917 MB, NON HEAP: 57/59/-1 MB
> (used/committed/max)]
> 17:35:31,418 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:31,419 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:32,419 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 638/1917/1917 MB, NON HEAP: 57/59/-1 MB
> (used/committed/max)]
> 17:35:32,419 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:32,419 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:33,487 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 648/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:33,494 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:33,522 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:34,523 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 662/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:34,523 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:34,523 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:35,523 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 670/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:35,524 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:35,524 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:36,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 717/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:36,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:36,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:37,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 737/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:37,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:37,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:38,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 747/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:38,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:38,525 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:39,526 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 817/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:39,526 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:39,526 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:40,526 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 832/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:40,526 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:40,526 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:41,527 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 840/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:41,527 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:41,527 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:42,527 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 847/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:42,527 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:42,527 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:43,599 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 450/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:43,599 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:43,599 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:44,599 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 508/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:44,599 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:44,599 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:45,600 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 517/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:45,600 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:45,600 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:46,600 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 528/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:46,600 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:46,600 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:47,663 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 541/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:47,664 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:47,664 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:48,791 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 554/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:48,791 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:48,791 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:49,794 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 562/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:49,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:49,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:50,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 569/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:50,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:50,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:51,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 582/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:51,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:51,795 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:52,796 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 593/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:52,796 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:52,796 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:53,796 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 600/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:53,796 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:53,796 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:54,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 604/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:54,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:54,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:55,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 610/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:55,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:55,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:56,797 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 615/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:56,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:56,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:57,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 624/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:57,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:57,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:58,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 636/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:58,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:58,798 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:35:59,799 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 641/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:35:59,799 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:35:59,799 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:36:00,799 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 648/1917/1917 MB, NON HEAP: 58/59/-1 MB
> (used/committed/max)]
> 17:36:00,799 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:36:00,799 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:36:01,821 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 655/1917/1917 MB, NON HEAP: 58/60/-1 MB
> (used/committed/max)]
> 17:36:01,936 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/35/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:36:01,936 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:36:02,937 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 665/1917/1917 MB, NON HEAP: 58/60/-1 MB
> (used/committed/max)]
> 17:36:02,937 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/35/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:36:02,937 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
> 17:36:03,944 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Memory
> usage stats: [HEAP: 666/1917/1917 MB, NON HEAP: 58/60/-1 MB
> (used/committed/max)]
> 17:36:03,950 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Off-heap
> pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], [Metaspace:
> 34/35/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB
> (used/committed/max)]
> 17:36:03,951 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS
> MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
>
> On Thu, Oct 29, 2015 at 1:55 PM, Till Rohrmann <trohrm...@apache.org> wrote:
>
>> What does the log of the failed TaskManager 10.0.88.140 say?
>>
>> On Thu, Oct 29, 2015 at 6:44 PM, Greg Hogan <c...@greghogan.com> wrote:
>>
>>> I removed the use of numactl but left in starting two TaskManagers and am
>>> still seeing TaskManagers crash.
>>> From the JobManager log:
>>>
>>> 17:36:06,412 WARN
>>> akka.remote.ReliableDeliverySupervisor -
>> Association
>>> with remote system [akka.tcp://flink@10.0.88.140:45742] has failed,
>>> address
>>> is now gated for [5000] ms. Reason is: [Disassociated].
>>> 17:36:06,567 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN
>>> GroupReduce (Compute scores) -> FlatMap (checksum()) (370/2322)
>>> (cac9927a8568c2ad79439262a91478af) switched from RUNNING to FAILED
>>> 17:36:06,572 INFO
>>> org.apache.flink.runtime.jobmanager.JobManager - Status of
>>> job 14d946015fd7b35eb801ea6fee5af9e4 (Flink Java Job at Thu Oct 29
>> 17:34:48
>>> UTC 2015) changed to FAILING.
>>> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce
>>> (Compute scores) -> FlatMap (checksum())' , caused an error: Error
>>> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated
>>> due to an exception: Connection unexpectedly closed by remote task
>> manager
>>> 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that the remote
>>> task manager was lost.
>>> at
>>> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465)
>>> at
>>> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
>>> Thread 'SortMerger Reading Thread' terminated due to an exception:
>>> Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/
>>> 10.0.88.140:58558'. This might indicate that the remote task manager was
>>> lost.
>>> at
>>>
>>>
>> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
>>> at
>>>
>> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1089)
>>> at
>>>
>>>
>> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
>>> at
>>> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459)
>>> ... 3 more
>>> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
>>> terminated due to an exception: Connection unexpectedly closed by remote
>>> task manager 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate
>> that
>>> the remote task manager was lost.
>>> at
>>>
>>>
>> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
>>> Caused by:
>>>
>>>
>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>>> Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/
>>> 10.0.88.140:58558'. This might indicate that the remote task manager was
>>> lost.
>>> at
>>>
>>>
>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
>>> at
>>>
>>>
>> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
>>> at
>>>
>>>
>> io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
>>> at
>>>
>>>
>> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
>>> at
>>>
>>>
>> io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
>>> at
>>>
>>>
>> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>>> at
>>>
>>>
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>>> at java.lang.Thread.run(Thread.java:745)
>>> 17:36:06,587 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN
>>> GroupReduce (Compute scores) -> FlatMap (checksum()) (367/2322)
>>> (d63c681a18b8164bc24936df1ecb159b) switched from RUNNING to FAILED
>>>
>>>
>>> On Thu, Oct 29, 2015 at 1:00 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Hi Greg!
>>>>
>>>> Interesting... When you say the TaskManagers are dropping, are the
>>>> TaskManager processes crashing, or are they loosing connection to the
>>>> JobManager?
>>>>
>>>> Greetings,
>>>> Stephan
>>>>
>>>>
>>>> On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <c...@greghogan.com>
>> wrote:
>>>>
>>>>> I recently discovered that AWS uses NUMA for its largest nodes. An
>>>> example
>>>>> c4.8xlarge:
>>>>>
>>>>> $ numactl --hardware
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26
>>>>> node 0 size: 29813 MB
>>>>> node 0 free: 24537 MB
>>>>> node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35
>>>>> node 1 size: 30574 MB
>>>>> node 1 free: 22757 MB
>>>>> node distances:
>>>>> node 0 1
>>>>> 0: 10 20
>>>>> 1: 20 10
>>>>>
>>>>> I discovered yesterday that Flink performed ~20-30% faster on large
>>>>> datasets by running two NUMA-constrained TaskManagers per node. The
>>>>> JobManager node ran a single TaskManager. Resources were divided in
>>> half
>>>>> relative to running a single TaskManager.
>>>>>
>>>>> The changes from the tail of /bin/taskmanager.sh:
>>>>>
>>>>> -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager
>> "${args[@]}"
>>>>> +numactl --membind=0 --cpunodebind=0
>> "${FLINK_BIN_DIR}"/flink-daemon.sh
>>>>> $STARTSTOP taskmanager "${args[@]}"
>>>>> +numactl --membind=1 --cpunodebind=1
>> "${FLINK_BIN_DIR}"/flink-daemon.sh
>>>>> $STARTSTOP taskmanager "${args[@]}"
>>>>>
>>>>> After reverting this change the system is again stable. I had not
>>>>> experienced issues using numactl when running 16 nodes.
>>>>>
>>>>> Greg
>>>>>
>>>>
>>>
>>