Late response, but a common reason for disappearing TaskManagers is termination 
by the Linux out-of-memory killer, with the recommendation to decrease the 
allotted memory.


> On Sep 5, 2017, at 9:09 AM, ShB <shon.balakris...@gmail.com> wrote:
> 
> Hi, 
> 
> I'm running a Flink batch job that reads almost 1 TB of data from S3 and
> then performs operations on it. A list of filenames are distributed among
> the TM's and each subset of files is read from S3 from each TM. This job
> errors out at the read step due to the following error:
> java.lang.Exception: TaskManager was lost/killed
> 
> Having read similar questions on the mailing list, it seems like this is a
> memory issue, with full GC at the TM causing the TM to be lost. 
> 
> After enabling memory debugging this seems to be the stats just before
> erroring out:
> Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB
> (used/committed/max)]
> Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory:
> 17148908
> Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)],
> [Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class Space:
> 5/5/1024 MB (used/committed/max)]
> Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC
> COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2]
> 
> I tried all of these suggested fixes: decreased taskmanager.memory.fraction
> to give more memory to user managed operations, increased number of
> JVM's(parallelism), used the G1 GC for better GC performance, but my job
> still errors out.  
> 
> I increased akka.watch.heartbeat.pause, akka.watch.threshold,
> akka.watch.heartbeat.interval to prevent the timeout due to GC. But this
> doesn't help either. I figured with the really high values for death watch,
> the program would run really slowly and complete at some point but it fails
> anyway. 
> 
> I'm now trying to decrease object creation in my program, but so far it
> hasn't helped.
> 
> How can I go about debugging and fixing this problem?
> 
> Thank you. 
> 
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to