Thanks Piotr and Stefan,
The problem was the overhead in the heap memory usage of the JobManager
when increasing the num-retained checkpoints. It was solved once I revert
that value to one.
BR
That's the actual error according to the JobManager log in the OOM:
2018-01-08 22:27:09,293 WARN
org.j
Hi,
This Task Manager log is suggesting that problems lays on the Job Manager side
(no visible gap in the logs, GC Time reported is accumulated and 31 seconds
accumulated over 963 gc collections is low value). Could you show the Job
Manager log itself? Probably it’s the own that’s causing the T
Hi,
I wonder what reason you might have that you ever want such a huge number
> of retained checkpoints?
The Flink jobs running on EMR cluster require a checkpoint at midnight. (In
our use case we need to synch a loaded delta to our a third party
partner with the streamed data). The delta load t
Hi,
there is no known limitation in the strict sense, but you might run out of dfs
space or job manager memory if you keep around a huge number checkpoints. I
wonder what reason you might have that you ever want such a huge number of
retained checkpoints? Usually keeping one checkpoint should d
Hi,
Increasing akka’s timeouts is rarely a solution for any problems - it either do
not help, or just mask the issue making it less visible. But yes, it is
possible to bump the limits:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akk