Hi again, I have found out that this issue occurred in 3 different clusters, and 2 of them could not recover after restarting pods, it seems state was completely corrupted afterwards and was thus lost. I had never seen this before 1.17.1, so it might be a newly introduced problem.
Regards, Alexis. Am Mo., 10. Juli 2023 um 11:07 Uhr schrieb Alexis Sarda-Espinosa < sarda.espin...@gmail.com>: > Hello, > > we have just experienced a weird issue in one of our Flink clusters which > might be difficult to reproduce, but I figured I would document it in case > some of you know what could have gone wrong. This cluster had been running > with Flink 1.16.1 for a long time and was recently updated to 1.17.1. It > ran fine for a few days, but suddenly all checkpoints started failing > (RocksDB + Azure ABFSS + Kubernetes HA). I don't see anything interesting > in the logs right before the problem started, just afterwards: > > Jul 9, 2023 @ > 18:13:41.271 org.apache.flink.runtime.checkpoint.CheckpointCoordinator > Completed > checkpoint 187398 for job 3d85035b76921c0a905f6c4fade06eca (19891956 bytes, > checkpointDuration=443 ms, finalizationTime=103 ms). > Jul 9, 2023 @ > 18:14:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator > Triggering > checkpoint 187399 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919280725 for job > 3d85035b76921c0a905f6c4fade06eca. > Jul 9, 2023 @ > 18:15:05.472 org.apache.kafka.clients.NetworkClient [AdminClient > clientId=flink-enumerator-admin-client] Node 0 disconnected. > Jul 9, 2023 @ > 18:15:10.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator > Checkpoint > 187399 of job 3d85035b76921c0a905f6c4fade06eca expired before completing. > Jul 9, 2023 @ > 18:15:10.741 org.apache.flink.runtime.checkpoint.CheckpointFailureManager > Failed > to trigger or complete checkpoint 187399 for job > 3d85035b76921c0a905f6c4fade06eca. (0 consecutive failed attempts so far) > Jul 9, 2023 @ > 18:15:10.905 org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable > asynchronous > part of checkpoint 187399 could not be completed. > Jul 9, 2023 @ > 18:15:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator > Triggering > checkpoint 187400 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919340725 for job > 3d85035b76921c0a905f6c4fade06eca. > *Jul 9, 2023 @ > 18:15:40.957 org.apache.flink.runtime.state.SharedStateRegistryImpl Duplicated > registration under key > ec5b73d0-a04c-4574-b380-7981c7173d80-KeyGroupRange{startKeyGroup=60, > endKeyGroup=119}-016511.sst of a new state: > ByteStreamStateHandle{handleName='abfss://.../checkpoints/3d85035b76921c0a905f6c4fade06eca/shared/2eddc140-51c8-4575-899f-e70ca71f95be', > dataBytes=1334}. This might happen during the task failover if state > backend creates different states with the same key before and after the > failure. Discarding the OLD state and keeping the NEW one which is included > into a completed checkpoint > > This last line appeared multiple times, and after that all checkpoints > failed. At some point, this exception also started appearing: > > org.apache.flink.runtime.jobmaster.JobMaster Error while processing > AcknowledgeCheckpoint message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not > finalize the pending checkpoint 188392. Failure reason: Failure to finalize > checkpoint. > ... > Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No > FileSystem for scheme "file" > > In the Kubernetes HA CM I can see that the value under ".data.counter" was > still increasing, not sure if that's expected, but since we configured only > 1 allowable consecutive checkpoint failure, the job kept restarting every 2 > minutes. Trying to create savepoints also failed with > UnsupportedFileSystemException. > > Manually deleting the Flink cluster's pods was enough to bring it back to > a working state. Maybe the handling for this error is not working correctly? > > Regards, > Alexis. >