Re: Taskmanagers are quarantined

2017-12-07 Thread T Obi
Hello, Thank you for much advice. Sorry for my late response. First, I made a little mistake. I set `env.java.opts.taskmanager` to enable GC log, and it cancelled to automatically set `UseG1GC` feature by accident. This means I watched log of Parallel GC. When I enabled both GC log and `UseG1GC`

Re: Taskmanagers are quarantined

2017-11-29 Thread Stephan Ewen
We also saw issues in the failure detection/quarantining with some Hadoop versions because of a subtle runtime netty version conflict. Fink 1.4 shades Flink's / Akka's Netty, in Flink 1.3 you may need to exclude the Netty dependency pulled in through Hadoop explicitly. Also, Hadoop version mismatc

Re: Taskmanagers are quarantined

2017-11-29 Thread Till Rohrmann
Hi, you could also try increasing the heartbeat timeout via `akka.watch.heartbeat.pause`. Maybe this helps to overcome the GC pauses. Cheers, Till On Wed, Nov 29, 2017 at 12:41 PM, T Obi wrote: > Warnings of Datanode appeared not in all cases of timeout. They seem > to be raised just by timeou

Re: Taskmanagers are quarantined

2017-11-29 Thread T Obi
Warnings of Datanode appeared not in all cases of timeout. They seem to be raised just by timeout while snapshotting. We output GC logs on taskmanagers and found that someone kicks System.gc() every an hour. So a full GC runs every an hour, and it takes about a minute or more in our cases... When

Re: Taskmanagers are quarantined

2017-11-27 Thread T Obi
Hello Chesnay, Thank you for answer to my rough question. Not all of taskmanagers are quarantined at a time, but each taskmanager has been quarantined at least once. We are using CDH 5.8 based on hadoop 2.6. We didn't give attention about datanodes. We will check it. However, we are also using t

Re: Taskmanagers are quarantined

2017-11-27 Thread Chesnay Schepler
Are only some taskmanagers quarantined, or all of them? Do the quarantined taskmanagers have anything in common? (are the failing ones always on certain machines; do the stacktraces reference the same hdfs datanodes) Which hadoop version are you using? From the stack-trace it appears that mul