Hi, Lars. Could you check whether you have configured the lifecycle of google cloud storage[1] which is not recommended in the flink checkpoint usage?
[1] https://cloud.google.com/storage/docs/lifecycle On Fri, Dec 9, 2022 at 2:02 AM Lars Skjærven <lar...@gmail.com> wrote: > Hello, > We had an incident today with a job that could not restore after crash > (for unknown reason). Specifically, it fails due to a missing checkpoint > file. We've experienced this a total of three times with Flink 1.15.2, but > never with 1.14.x. Last time was during a node upgrade, but that was not > the case this time. > > I've not been able to reproduce this issue. I've checked that I can kill > the taskmanager and jobmanager (using kubectl delete pod), and the job > restores as expected. > > The job is running with kubernetes high availability, rocksdb and > incremental checkpointing. > > Any tips are highly appreciated. > > Thanks, > Lars > > Caused by: org.apache.flink.util.FlinkException: Could not restore keyed > state backend for > KeyedProcessOperator_bf374b554824ef28e76619f4fa153430_(2/2) from any of the > 1 provided restore options. > at > org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160) > at > org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346) > at > org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164) > ... 11 more > Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught > unexpected exception. > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395) > at > org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483) > at > org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97) > at > org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329) > at > org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) > at > org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) > ... 13 more > Caused by: java.io.FileNotFoundException: Item not found: > 'gs://some-bucket-name/flink-jobs/namespaces/default/jobs/d60a6c94-ddbc-42a1-947e-90f62749835a/checkpoints/d60a6c94ddbc42a1947e90f62749835a/shared/3cb2bb55-b4b0-44e5-948a-5d38ec088253'. > Note, it is possible that the live version is still available but the > requested generation is deleted. > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:46) > > -- Best, Hangxiang.