Hi,

I'm running Flink job on 1.9 version with blink planner.

My checkpoints are timing out intermittently, but as state grows they are
timing out more and more often eventually killing the job.

Size of the state is large with Minimum=10.2MB and Maximum=49GB (this one
is accumulated due to prior failed ones), Average=8.44GB.

Although size is huge, I have enough space on EC2 instance in which I'm
running job. I'm using RocksDB for checkpointing.

*Logs does not have any useful information to understand why checkpoints
are expiring/failing, can someone please point me to tools that can be used
to investigate and understand why checkpoints are failing.*

Also any other related suggestions are welcome.


Thanks,
Reva.

Reply via email to