hook a callback on checkpointing failure.

Mathieu D Thu, 14 Oct 2021 05:45:03 -0700

Hey there,

We have some instabilities around checkpointing, that we don't quite
understand.
In general, as soon as a checkpoint fails, our cluster does not recover
back to a proper state.
But to better understand the mechanism, we'd like to be notified as soon as
this happens, so we can jump on our console and try to understand the
problem.


So, in my mind, we'd simply send a slack notif to some ops, as soon as a
checkpoint fails.

Is there a way to register a callback in the checkpointing system, and get
called as soon one fails ?

[FWIW our config: Flink 1.12 on Yarn/EMR, checkpointing on s3,
rocksdbbackend]

Thanks.
Mathieu

hook a callback on checkpointing failure.

Reply via email to