Hi,
One way to do it would be to use the Flink Metrics [1] and use something
like Prometheus to scrape the metrics and use them to create alerts?
Thanks,
Martijn
[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#checkpointing
On Thu, 14 Oct 2021 at 14:45, Mathieu D
Hey there,
We have some instabilities around checkpointing, that we don't quite
understand.
In general, as soon as a checkpoint fails, our cluster does not recover
back to a proper state.
But to better understand the mechanism, we'd like to be notified as soon as
this happens, so we can jump on ou