Hello,

we have just experienced a weird issue in one of our Flink clusters which
might be difficult to reproduce, but I figured I would document it in case
some of you know what could have gone wrong. This cluster had been running
with Flink 1.16.1 for a long time and was recently updated to 1.17.1. It
ran fine for a few days, but suddenly all checkpoints started failing
(RocksDB + Azure ABFSS + Kubernetes HA). I don't see anything interesting
in the logs right before the problem started, just afterwards:

Jul 9, 2023 @
18:13:41.271 org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Completed
checkpoint 187398 for job 3d85035b76921c0a905f6c4fade06eca (19891956 bytes,
checkpointDuration=443 ms, finalizationTime=103 ms).
Jul 9, 2023 @
18:14:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Triggering
checkpoint 187399 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919280725 for job
3d85035b76921c0a905f6c4fade06eca.
Jul 9, 2023 @
18:15:05.472 org.apache.kafka.clients.NetworkClient [AdminClient
clientId=flink-enumerator-admin-client] Node 0 disconnected.
Jul 9, 2023 @
18:15:10.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Checkpoint
187399 of job 3d85035b76921c0a905f6c4fade06eca expired before completing.
Jul 9, 2023 @
18:15:10.741 org.apache.flink.runtime.checkpoint.CheckpointFailureManager
Failed
to trigger or complete checkpoint 187399 for job
3d85035b76921c0a905f6c4fade06eca. (0 consecutive failed attempts so far)
Jul 9, 2023 @
18:15:10.905 org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable
asynchronous
part of checkpoint 187399 could not be completed.
Jul 9, 2023 @
18:15:40.740 org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Triggering
checkpoint 187400 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688919340725 for job
3d85035b76921c0a905f6c4fade06eca.
*Jul 9, 2023 @
18:15:40.957 org.apache.flink.runtime.state.SharedStateRegistryImpl Duplicated
registration under key
ec5b73d0-a04c-4574-b380-7981c7173d80-KeyGroupRange{startKeyGroup=60,
endKeyGroup=119}-016511.sst of a new state:
ByteStreamStateHandle{handleName='abfss://.../checkpoints/3d85035b76921c0a905f6c4fade06eca/shared/2eddc140-51c8-4575-899f-e70ca71f95be',
dataBytes=1334}. This might happen during the task failover if state
backend creates different states with the same key before and after the
failure. Discarding the OLD state and keeping the NEW one which is included
into a completed checkpoint

This last line appeared multiple times, and after that all checkpoints
failed. At some point, this exception also started appearing:

org.apache.flink.runtime.jobmaster.JobMaster Error while processing
AcknowledgeCheckpoint message
    org.apache.flink.runtime.checkpoint.CheckpointException: Could not
finalize the pending checkpoint 188392. Failure reason: Failure to finalize
checkpoint.
    ...
    Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
FileSystem for scheme "file"

In the Kubernetes HA CM I can see that the value under ".data.counter" was
still increasing, not sure if that's expected, but since we configured only
1 allowable consecutive checkpoint failure, the job kept restarting every 2
minutes. Trying to create savepoints also failed with
UnsupportedFileSystemException.

Manually deleting the Flink cluster's pods was enough to bring it back to a
working state. Maybe the handling for this error is not working correctly?

Regards,
Alexis.

Reply via email to