Hey there,

We have some instabilities around checkpointing, that we don't quite
understand.
In general, as soon as a checkpoint fails, our cluster does not recover
back to a proper state.
But to better understand the mechanism, we'd like to be notified as soon as
this happens, so we can jump on our console and try to understand the
problem.

So, in my mind, we'd simply send a slack notif to some ops, as soon as a
checkpoint fails.

Is there a way to register a callback in the checkpointing system, and get
called as soon one fails ?

[FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3,
rocksdbbackend]

Thanks.
Mathieu

Reply via email to