Late response, but a common reason for disappearing TaskManagers is termination by the Linux out-of-memory killer, with the recommendation to decrease the allotted memory.
> On Sep 5, 2017, at 9:09 AM, ShB <shon.balakris...@gmail.com> wrote: > > Hi, > > I'm running a Flink batch job that reads almost 1 TB of data from S3 and > then performs operations on it. A list of filenames are distributed among > the TM's and each subset of files is read from S3 from each TM. This job > errors out at the read step due to the following error: > java.lang.Exception: TaskManager was lost/killed > > Having read similar questions on the mailing list, it seems like this is a > memory issue, with full GC at the TM causing the TM to be lost. > > After enabling memory debugging this seems to be the stats just before > erroring out: > Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB > (used/committed/max)] > Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory: > 17148908 > Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)], > [Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class Space: > 5/5/1024 MB (used/committed/max)] > Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC > COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2] > > I tried all of these suggested fixes: decreased taskmanager.memory.fraction > to give more memory to user managed operations, increased number of > JVM's(parallelism), used the G1 GC for better GC performance, but my job > still errors out. > > I increased akka.watch.heartbeat.pause, akka.watch.threshold, > akka.watch.heartbeat.interval to prevent the timeout due to GC. But this > doesn't help either. I figured with the really high values for death watch, > the program would run really slowly and complete at some point but it fails > anyway. > > I'm now trying to decrease object creation in my program, but so far it > hasn't helped. > > How can I go about debugging and fixing this problem? > > Thank you. > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/