Re: Diagnosing TaskManager disappearance

2015-12-14 Thread Stephan Ewen
Hi! The Netty memory usually goes much lower than 2*network memory (that is theoretical). Netty needs memory at the size two buffers on the sender and receiver side, per TCP connection. Since Flink usually multiplexes many Channels (that need network buffers) through the same TCP connection, the

Re: Diagnosing TaskManager disappearance

2015-12-12 Thread Greg Hogan
The TaskManagers were nixed by the OOM killer. [63896.699500] Out of memory: Kill process 12892 (java) score 910 or sacrifice child [63896.702018] Killed process 12892 (java) total-vm:47398740kB, anon-rss:28487812kB, file-rss:8kB The cluster is comprised of AWS c4.8xlarge instances which have

Re: Diagnosing TaskManager disappearance

2015-10-30 Thread Till Rohrmann
The logging of the TaskManager stops 3 seconds before the JobManager detects that the connection to the TaskManager is failed. If the clocks are remotely in sync and the TaskManager is still running, then we should also see logging statements for the time after the connection has failed. Therefore,

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Robert Metzger
So is the TaskManager JVM still running after the JM detected that the TM has gone? If not, can you check the kernel log (dmesg) to see whether Linux OOM killer stopped the process? (if its a kill, the JVM might not be able to log anything anymore) On Thu, Oct 29, 2015 at 9:27 PM, Stephan Ewen w

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Stephan Ewen
Thanks for sharing the logs, Greg! Okay, so the TaskManager does not crash, but the Remote Failure Detector of Akka marks the connection between JobManager and TaskManager as broken. The TaskManager is not doing much GC, so it is not a long JVM freeze that causes hearbeats to time out... I am wo

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Aljoscha Krettek
Could it be a problem that there are two TaskManagers running per machine? > On 29 Oct 2015, at 19:04, Greg Hogan wrote: > > I have memory logging enabled. Tail of TaskManager log on 10.0.88.140: > > 17:35:26,415 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > c

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Greg Hogan
I have memory logging enabled. Tail of TaskManager log on 10.0.88.140: 17:35:26,415 INFO org.apache.flink.runtime.taskmanager.TaskManager - Garbage collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS MarkSweep, GC TIME (ms): 974, GC COUNT: 1] 17:35:27,415 INFO org.apac

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Till Rohrmann
What does the log of the failed TaskManager 10.0.88.140 say? On Thu, Oct 29, 2015 at 6:44 PM, Greg Hogan wrote: > I removed the use of numactl but left in starting two TaskManagers and am > still seeing TaskManagers crash. > From the JobManager log: > > 17:36:06,412 WARN > akka.remote.ReliableDe

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Greg Hogan
I removed the use of numactl but left in starting two TaskManagers and am still seeing TaskManagers crash. >From the JobManager log: 17:36:06,412 WARN akka.remote.ReliableDeliverySupervisor- Association with remote system [akka.tcp://flink@10.0.88.140:45742] has failed, add

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Stephan Ewen
Hi Greg! Interesting... When you say the TaskManagers are dropping, are the TaskManager processes crashing, or are they loosing connection to the JobManager? Greetings, Stephan On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan wrote: > I recently discovered that AWS uses NUMA for its largest nodes.

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Greg Hogan
I recently discovered that AWS uses NUMA for its largest nodes. An example c4.8xlarge: $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 node 0 size: 29813 MB node 0 free: 24537 MB node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34

Re: Diagnosing TaskManager disappearance

2015-10-29 Thread Maximilian Michels
Hi Greg, Thanks for reporting. You wrote you didn't see any output in the .out files of the task managers. What about the .log files of these instances? Where and when did you produce the thread dump you included? Thanks, Max On Thu, Oct 29, 2015 at 1:46 PM, Greg Hogan wrote: > I am testing a

Diagnosing TaskManager disappearance

2015-10-29 Thread Greg Hogan
I am testing again on a 64 node cluster (the JobManager is running fine having reduced some operator's parallelism and fixed the string conversion performance). I am seeing TaskManagers drop like flies every other job or so. I am not seeing any output in the .out log files corresponding to the cra