Hi,
increasing the time to detect a dead task manager usually increases the
amount of elements that need to be reprocessed in case of a failure.
Once a dead task manager is identified, the entire application is rolled
back to the latest successful checkpointed/consistent state of the
application. So it is desirable to keep this time low in order to keep
the time to catch up low. Faul tolerance guarantees should not be affected.
I hope this helps.
Regards,
Timo
Am 15.05.18 um 01:42 schrieb Bajaj, Abhinav:
Hi,
We are running into issues where GC pause will result into
Taskmanagers being marked dead incorrectly.
Flink documentation
<https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#distributed-coordination-via-akka>
documents some knobs of Akka configurations to play around.
Focusing on /“akka.watch.heartbeat.pause”,/ it mentions /“Higher value
increases the time to detect a dead TaskManager”/
Can someone please help me understand the downside of increasing the
time to detect a dead taskmanager?
Will this affect the fault tolerance guarantees / state management/
checkpointing?
Thanks,
Abhinav