Hello, we have just experienced a weird issue in one of our Flink clusters which might be difficult to reproduce, but I figured I would document it in case some of you know what could have gone wrong. This cluster had been running with Flink 1.16.1 for a long time and was recently updated to 1.17.1. It ran fine for a few days, but suddenly all checkpoints started failing (RocksDB + Azure ABFSS + Kubernetes HA). I don't see anything interesting in the logs right before the problem started, just afterwards:
Jul 9, 2023 @ 18:13:41.271 org.apache.flink.runtime.checkpoint.CheckpointCoordinator Completed checkpoint 187398 for job 3d85035b76921c0a905f6c4fade06eca (19891956 bytes, checkpointDuration=443 ms, finalizationTime=103 ms). Jul 9, 2023 @ 18:14:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator Triggering checkpoint 187399 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919280725 for job 3d85035b76921c0a905f6c4fade06eca. Jul 9, 2023 @ 18:15:05.472 org.apache.kafka.clients.NetworkClient [AdminClient clientId=flink-enumerator-admin-client] Node 0 disconnected. Jul 9, 2023 @ 18:15:10.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator Checkpoint 187399 of job 3d85035b76921c0a905f6c4fade06eca expired before completing. Jul 9, 2023 @ 18:15:10.741 org.apache.flink.runtime.checkpoint.CheckpointFailureManager Failed to trigger or complete checkpoint 187399 for job 3d85035b76921c0a905f6c4fade06eca. (0 consecutive failed attempts so far) Jul 9, 2023 @ 18:15:10.905 org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable asynchronous part of checkpoint 187399 could not be completed. Jul 9, 2023 @ 18:15:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator Triggering checkpoint 187400 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919340725 for job 3d85035b76921c0a905f6c4fade06eca. *Jul 9, 2023 @ 18:15:40.957 org.apache.flink.runtime.state.SharedStateRegistryImpl Duplicated registration under key ec5b73d0-a04c-4574-b380-7981c7173d80-KeyGroupRange{startKeyGroup=60, endKeyGroup=119}-016511.sst of a new state: ByteStreamStateHandle{handleName='abfss://.../checkpoints/3d85035b76921c0a905f6c4fade06eca/shared/2eddc140-51c8-4575-899f-e70ca71f95be', dataBytes=1334}. This might happen during the task failover if state backend creates different states with the same key before and after the failure. Discarding the OLD state and keeping the NEW one which is included into a completed checkpoint This last line appeared multiple times, and after that all checkpoints failed. At some point, this exception also started appearing: org.apache.flink.runtime.jobmaster.JobMaster Error while processing AcknowledgeCheckpoint message org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 188392. Failure reason: Failure to finalize checkpoint. ... Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file" In the Kubernetes HA CM I can see that the value under ".data.counter" was still increasing, not sure if that's expected, but since we configured only 1 allowable consecutive checkpoint failure, the job kept restarting every 2 minutes. Trying to create savepoints also failed with UnsupportedFileSystemException. Manually deleting the Flink cluster's pods was enough to bring it back to a working state. Maybe the handling for this error is not working correctly? Regards, Alexis.