Hello, My Flink application has entered into a bad state and I was wondering if I could get some advice on how to resolve the issue.
The sequence of events that led to a bad state: 1. A failure occurs (e.g., TM timeout) within the cluster 2. The application successfully recovers from the last completed checkpoint 3. The application consumes events from Kafka as quickly as it can. This introduces high backpressure. 4. A checkpoint is triggered 5. Another failure occurs (e.g., TM timeout, checkpoint timeout, Kafka transaction timeout) and the application loops back to step #2. This creates a vicious cycle where no progress is made. I believe the underlying issue is the application experiencing high backpressure. This can cause the TM to not respond to heartbeats or cause long checkpoint durations due to delayed processing of the checkpoint. I'm confused on the best next steps to take. How do I ensure that heartbeats and checkpoints successfully complete when experiencing inevitable high packpressure? Best, Hubert