I found out someone else reported this and found a workaround: https://issues.apache.org/jira/browse/FLINK-32241
Am Mo., 10. Juli 2023 um 16:45 Uhr schrieb Alexis Sarda-Espinosa < sarda.espin...@gmail.com>: > Hi again, > > I have found out that this issue occurred in 3 different clusters, and 2 > of them could not recover after restarting pods, it seems state was > completely corrupted afterwards and was thus lost. I had never seen this > before 1.17.1, so it might be a newly introduced problem. > > Regards, > Alexis. > > > Am Mo., 10. Juli 2023 um 11:07 Uhr schrieb Alexis Sarda-Espinosa < > sarda.espin...@gmail.com>: > >> Hello, >> >> we have just experienced a weird issue in one of our Flink clusters which >> might be difficult to reproduce, but I figured I would document it in case >> some of you know what could have gone wrong. This cluster had been running >> with Flink 1.16.1 for a long time and was recently updated to 1.17.1. It >> ran fine for a few days, but suddenly all checkpoints started failing >> (RocksDB + Azure ABFSS + Kubernetes HA). I don't see anything interesting >> in the logs right before the problem started, just afterwards: >> >> Jul 9, 2023 @ >> 18:13:41.271 org.apache.flink.runtime.checkpoint.CheckpointCoordinator >> Completed >> checkpoint 187398 for job 3d85035b76921c0a905f6c4fade06eca (19891956 bytes, >> checkpointDuration=443 ms, finalizationTime=103 ms). >> Jul 9, 2023 @ >> 18:14:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator >> Triggering >> checkpoint 187399 (type=CheckpointType{name='Checkpoint', >> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919280725 for job >> 3d85035b76921c0a905f6c4fade06eca. >> Jul 9, 2023 @ >> 18:15:05.472 org.apache.kafka.clients.NetworkClient [AdminClient >> clientId=flink-enumerator-admin-client] Node 0 disconnected. >> Jul 9, 2023 @ >> 18:15:10.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator >> Checkpoint >> 187399 of job 3d85035b76921c0a905f6c4fade06eca expired before completing. >> Jul 9, 2023 @ >> 18:15:10.741 org.apache.flink.runtime.checkpoint.CheckpointFailureManager >> Failed >> to trigger or complete checkpoint 187399 for job >> 3d85035b76921c0a905f6c4fade06eca. (0 consecutive failed attempts so far) >> Jul 9, 2023 @ >> 18:15:10.905 >> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable asynchronous >> part of checkpoint 187399 could not be completed. >> Jul 9, 2023 @ >> 18:15:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator >> Triggering >> checkpoint 187400 (type=CheckpointType{name='Checkpoint', >> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919340725 for job >> 3d85035b76921c0a905f6c4fade06eca. >> *Jul 9, 2023 @ >> 18:15:40.957 org.apache.flink.runtime.state.SharedStateRegistryImpl >> Duplicated >> registration under key >> ec5b73d0-a04c-4574-b380-7981c7173d80-KeyGroupRange{startKeyGroup=60, >> endKeyGroup=119}-016511.sst of a new state: >> ByteStreamStateHandle{handleName='abfss://.../checkpoints/3d85035b76921c0a905f6c4fade06eca/shared/2eddc140-51c8-4575-899f-e70ca71f95be', >> dataBytes=1334}. This might happen during the task failover if state >> backend creates different states with the same key before and after the >> failure. Discarding the OLD state and keeping the NEW one which is included >> into a completed checkpoint >> >> This last line appeared multiple times, and after that all checkpoints >> failed. At some point, this exception also started appearing: >> >> org.apache.flink.runtime.jobmaster.JobMaster Error while processing >> AcknowledgeCheckpoint message >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not >> finalize the pending checkpoint 188392. Failure reason: Failure to finalize >> checkpoint. >> ... >> Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No >> FileSystem for scheme "file" >> >> In the Kubernetes HA CM I can see that the value under ".data.counter" >> was still increasing, not sure if that's expected, but since we configured >> only 1 allowable consecutive checkpoint failure, the job kept restarting >> every 2 minutes. Trying to create savepoints also failed with >> UnsupportedFileSystemException. >> >> Manually deleting the Flink cluster's pods was enough to bring it back to >> a working state. Maybe the handling for this error is not working correctly? >> >> Regards, >> Alexis. >> >