Re: hook a callback on checkpointing failure.

2021-10-14 Thread Martijn Visser
Hi, One way to do it would be to use the Flink Metrics [1] and use something like Prometheus to scrape the metrics and use them to create alerts? Thanks, Martijn [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#checkpointing On Thu, 14 Oct 2021 at 14:45, Mathieu D

hook a callback on checkpointing failure.

2021-10-14 Thread Mathieu D
Hey there, We have some instabilities around checkpointing, that we don't quite understand. In general, as soon as a checkpoint fails, our cluster does not recover back to a proper state. But to better understand the mechanism, we'd like to be notified as soon as this happens, so we can jump on ou