Hi group. I’d like some advice on how to debug a problem we have encountered. We have a session cluster deployed on AKS, using Kubernetes Operator. Our jobs have checkpointing enabled and most of them work fine. However, one job is having problems checkpointing.
The job is relatively simple: Kafka Source -> Filter -> ProcessFunction -> Filter (with some side-channel output) -> Kafka Sink. This is a PyFlink job, if it is relevant. The job starts and runs, but after several successful checkpoints it fails to create a checkpoint. Then it succeeds, then fails again and after some oscillating, it fails all of them. The job is behaving erratically, after some time it stops processing the messages. The Kafka source topic is showing a big lag (150k messages), so it stops processing after a while. The exceptions we get are uninformative: 2025-09-19 11:49:17,084 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 18 for job 6ef604e79abcdd65eb65d6cb3fa20d6a. (0 consecutive failed attempts so far) org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired before completing. org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. The latest checkpoint failed due to Checkpoint expired before completing., view the Checkpoint History tab or the Job Manager log to find out why continuous checkpoints failed. 2025-09-19 11:49:17,088 INFO org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger checkpoint for job 6ef604e79abcdd65eb65d6cb3fa20d6a since Checkpoint triggering task Source: Kafka source -> Input mapper, Filter incoming, Process entity request, PROCESS, PROCESS -> (Kafka Sink: Writer -> Kafka Sink: Committer, Get side error output -> Error topic: Writer -> Error topic: Committer, Get side IMS config output, Process IMS config creation, PROCESS -> IMS config topic: Writer -> IMS config topic: Committer) (1/1) of job 6ef604e79abcdd65eb65d6cb3fa20d6a is not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running.. Our checkpoints are minuscule, less than 5kB, and when they do succeed time is 3 seconds, max. Basically, we do not have any state. The storage used is Azure Blob, since we are running in Azure AKS (any recommendations there?). What we do observe on Azure Blob is that the old checkpoints have been deleted (as desired), but new ones were not created. Our checkpoint-related settings are: jobmanager.memory.heap.size 6522981568b jobmanager.memory.jvm-metaspace.size 1073741824b jobmanager.memory.jvm-overhead.max 858993472b jobmanager.memory.jvm-overhead.min 858993472b jobmanager.memory.off-heap.size 134217728b jobmanager.memory.process.size 8 gb execution.checkpointing.dir wasbs://[email protected]/flink-cluster-ims-flink/flink-checkpoints execution.checkpointing.externalized-checkpoint-retention DELETE_ON_CANCELLATION execution.checkpointing.incremental true execution.checkpointing.interval 60000 execution.checkpointing.min-pause 5000 execution.checkpointing.mode EXACTLY_ONCE execution.checkpointing.num-retained 3 execution.checkpointing.savepoint-dir wasbs://[email protected]/flink-cluster-ims-flink/flink-savepoints execution.checkpointing.storage jobmanager execution.checkpointing.timeout 300000 Nix
