Hey there, We have some instabilities around checkpointing, that we don't quite understand. In general, as soon as a checkpoint fails, our cluster does not recover back to a proper state. But to better understand the mechanism, we'd like to be notified as soon as this happens, so we can jump on our console and try to understand the problem.
So, in my mind, we'd simply send a slack notif to some ops, as soon as a checkpoint fails. Is there a way to register a callback in the checkpointing system, and get called as soon one fails ? [FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3, rocksdbbackend] Thanks. Mathieu