Hello,
Thank you for much advice. Sorry for my late response.
First, I made a little mistake.
I set `env.java.opts.taskmanager` to enable GC log, and it cancelled
to automatically set `UseG1GC` feature by accident.
This means I watched log of Parallel GC.
When I enabled both GC log and `UseG1GC`
We also saw issues in the failure detection/quarantining with some Hadoop
versions because of a subtle runtime netty version conflict. Fink 1.4
shades Flink's / Akka's Netty, in Flink 1.3 you may need to exclude the
Netty dependency pulled in through Hadoop explicitly.
Also, Hadoop version mismatc
Hi,
you could also try increasing the heartbeat timeout via
`akka.watch.heartbeat.pause`. Maybe this helps to overcome the GC pauses.
Cheers,
Till
On Wed, Nov 29, 2017 at 12:41 PM, T Obi wrote:
> Warnings of Datanode appeared not in all cases of timeout. They seem
> to be raised just by timeou
Warnings of Datanode appeared not in all cases of timeout. They seem
to be raised just by timeout while snapshotting.
We output GC logs on taskmanagers and found that someone kicks
System.gc() every an hour.
So a full GC runs every an hour, and it takes about a minute or more
in our cases...
When
Hello Chesnay,
Thank you for answer to my rough question.
Not all of taskmanagers are quarantined at a time, but each
taskmanager has been quarantined at least once.
We are using CDH 5.8 based on hadoop 2.6.
We didn't give attention about datanodes. We will check it.
However, we are also using t
Are only some taskmanagers quarantined, or all of them?
Do the quarantined taskmanagers have anything in common?
(are the failing ones always on certain machines; do the stacktraces
reference the same hdfs datanodes)
Which hadoop version are you using?
From the stack-trace it appears that mul