Hi again,

I have found out that this issue occurred in 3 different clusters, and 2 of
them could not recover after restarting pods, it seems state was completely
corrupted afterwards and was thus lost. I had never seen this before
1.17.1, so it might be a newly introduced problem.

Regards,
Alexis.


Am Mo., 10. Juli 2023 um 11:07 Uhr schrieb Alexis Sarda-Espinosa <
sarda.espin...@gmail.com>:

> Hello,
>
> we have just experienced a weird issue in one of our Flink clusters which
> might be difficult to reproduce, but I figured I would document it in case
> some of you know what could have gone wrong. This cluster had been running
> with Flink 1.16.1 for a long time and was recently updated to 1.17.1. It
> ran fine for a few days, but suddenly all checkpoints started failing
> (RocksDB + Azure ABFSS + Kubernetes HA). I don't see anything interesting
> in the logs right before the problem started, just afterwards:
>
> Jul 9, 2023 @
> 18:13:41.271 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> Completed
> checkpoint 187398 for job 3d85035b76921c0a905f6c4fade06eca (19891956 bytes,
> checkpointDuration=443 ms, finalizationTime=103 ms).
> Jul 9, 2023 @
> 18:14:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> Triggering
> checkpoint 187399 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919280725 for job
> 3d85035b76921c0a905f6c4fade06eca.
> Jul 9, 2023 @
> 18:15:05.472 org.apache.kafka.clients.NetworkClient [AdminClient
> clientId=flink-enumerator-admin-client] Node 0 disconnected.
> Jul 9, 2023 @
> 18:15:10.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> Checkpoint
> 187399 of job 3d85035b76921c0a905f6c4fade06eca expired before completing.
> Jul 9, 2023 @
> 18:15:10.741 org.apache.flink.runtime.checkpoint.CheckpointFailureManager 
> Failed
> to trigger or complete checkpoint 187399 for job
> 3d85035b76921c0a905f6c4fade06eca. (0 consecutive failed attempts so far)
> Jul 9, 2023 @
> 18:15:10.905 org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable 
> asynchronous
> part of checkpoint 187399 could not be completed.
> Jul 9, 2023 @
> 18:15:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> Triggering
> checkpoint 187400 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919340725 for job
> 3d85035b76921c0a905f6c4fade06eca.
> *Jul 9, 2023 @
> 18:15:40.957 org.apache.flink.runtime.state.SharedStateRegistryImpl Duplicated
> registration under key
> ec5b73d0-a04c-4574-b380-7981c7173d80-KeyGroupRange{startKeyGroup=60,
> endKeyGroup=119}-016511.sst of a new state:
> ByteStreamStateHandle{handleName='abfss://.../checkpoints/3d85035b76921c0a905f6c4fade06eca/shared/2eddc140-51c8-4575-899f-e70ca71f95be',
> dataBytes=1334}. This might happen during the task failover if state
> backend creates different states with the same key before and after the
> failure. Discarding the OLD state and keeping the NEW one which is included
> into a completed checkpoint
>
> This last line appeared multiple times, and after that all checkpoints
> failed. At some point, this exception also started appearing:
>
> org.apache.flink.runtime.jobmaster.JobMaster Error while processing
> AcknowledgeCheckpoint message
>     org.apache.flink.runtime.checkpoint.CheckpointException: Could not
> finalize the pending checkpoint 188392. Failure reason: Failure to finalize
> checkpoint.
>     ...
>     Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
> FileSystem for scheme "file"
>
> In the Kubernetes HA CM I can see that the value under ".data.counter" was
> still increasing, not sure if that's expected, but since we configured only
> 1 allowable consecutive checkpoint failure, the job kept restarting every 2
> minutes. Trying to create savepoints also failed with
> UnsupportedFileSystemException.
>
> Manually deleting the Flink cluster's pods was enough to bring it back to
> a working state. Maybe the handling for this error is not working correctly?
>
> Regards,
> Alexis.
>

Reply via email to