Re: Checkpointing and savepoints can never complete after inconsistency

Alexis Sarda-Espinosa Mon, 10 Jul 2023 08:34:44 -0700

I found out someone else reported this and found a workaround:
https://issues.apache.org/jira/browse/FLINK-32241


Am Mo., 10. Juli 2023 um 16:45 Uhr schrieb Alexis Sarda-Espinosa <
sarda.espin...@gmail.com>:

> Hi again,
>
> I have found out that this issue occurred in 3 different clusters, and 2
> of them could not recover after restarting pods, it seems state was
> completely corrupted afterwards and was thus lost. I had never seen this
> before 1.17.1, so it might be a newly introduced problem.
>
> Regards,
> Alexis.
>
>
> Am Mo., 10. Juli 2023 um 11:07 Uhr schrieb Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com>:
>
>> Hello,
>>
>> we have just experienced a weird issue in one of our Flink clusters which
>> might be difficult to reproduce, but I figured I would document it in case
>> some of you know what could have gone wrong. This cluster had been running
>> with Flink 1.16.1 for a long time and was recently updated to 1.17.1. It
>> ran fine for a few days, but suddenly all checkpoints started failing
>> (RocksDB + Azure ABFSS + Kubernetes HA). I don't see anything interesting
>> in the logs right before the problem started, just afterwards:
>>
>> Jul 9, 2023 @
>> 18:13:41.271 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
>> Completed
>> checkpoint 187398 for job 3d85035b76921c0a905f6c4fade06eca (19891956 bytes,
>> checkpointDuration=443 ms, finalizationTime=103 ms).
>> Jul 9, 2023 @
>> 18:14:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
>> Triggering
>> checkpoint 187399 (type=CheckpointType{name='Checkpoint',
>> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919280725 for job
>> 3d85035b76921c0a905f6c4fade06eca.
>> Jul 9, 2023 @
>> 18:15:05.472 org.apache.kafka.clients.NetworkClient [AdminClient
>> clientId=flink-enumerator-admin-client] Node 0 disconnected.
>> Jul 9, 2023 @
>> 18:15:10.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
>> Checkpoint
>> 187399 of job 3d85035b76921c0a905f6c4fade06eca expired before completing.
>> Jul 9, 2023 @
>> 18:15:10.741 org.apache.flink.runtime.checkpoint.CheckpointFailureManager 
>> Failed
>> to trigger or complete checkpoint 187399 for job
>> 3d85035b76921c0a905f6c4fade06eca. (0 consecutive failed attempts so far)
>> Jul 9, 2023 @
>> 18:15:10.905 
>> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable asynchronous
>> part of checkpoint 187399 could not be completed.
>> Jul 9, 2023 @
>> 18:15:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
>> Triggering
>> checkpoint 187400 (type=CheckpointType{name='Checkpoint',
>> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919340725 for job
>> 3d85035b76921c0a905f6c4fade06eca.
>> *Jul 9, 2023 @
>> 18:15:40.957 org.apache.flink.runtime.state.SharedStateRegistryImpl 
>> Duplicated
>> registration under key
>> ec5b73d0-a04c-4574-b380-7981c7173d80-KeyGroupRange{startKeyGroup=60,
>> endKeyGroup=119}-016511.sst of a new state:
>> ByteStreamStateHandle{handleName='abfss://.../checkpoints/3d85035b76921c0a905f6c4fade06eca/shared/2eddc140-51c8-4575-899f-e70ca71f95be',
>> dataBytes=1334}. This might happen during the task failover if state
>> backend creates different states with the same key before and after the
>> failure. Discarding the OLD state and keeping the NEW one which is included
>> into a completed checkpoint
>>
>> This last line appeared multiple times, and after that all checkpoints
>> failed. At some point, this exception also started appearing:
>>
>> org.apache.flink.runtime.jobmaster.JobMaster Error while processing
>> AcknowledgeCheckpoint message
>>     org.apache.flink.runtime.checkpoint.CheckpointException: Could not
>> finalize the pending checkpoint 188392. Failure reason: Failure to finalize
>> checkpoint.
>>     ...
>>     Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
>> FileSystem for scheme "file"
>>
>> In the Kubernetes HA CM I can see that the value under ".data.counter"
>> was still increasing, not sure if that's expected, but since we configured
>> only 1 allowable consecutive checkpoint failure, the job kept restarting
>> every 2 minutes. Trying to create savepoints also failed with
>> UnsupportedFileSystemException.
>>
>> Manually deleting the Flink cluster's pods was enough to bring it back to
>> a working state. Maybe the handling for this error is not working correctly?
>>
>> Regards,
>> Alexis.
>>
>

Re: Checkpointing and savepoints can never complete after inconsistency

Reply via email to