Re: Corrupted unaligned checkpoints in Flink 1.11.1

Chesnay Schepler Thu, 03 Jun 2021 01:24:21 -0700

Is there anything in the Flink logs indicating issues with writing thecheckpoint data?When the savepoint could not be created, was anything logged from Flink?How did you shut down the cluster?


On 6/3/2021 5:56 AM, Alexander Filipchik wrote:

Hi,
Trying to figure out what happened with our Flink job. We use flink1.11.1 and run a job with unaligned checkpoints and Rocks Db backend.The whole state is around 300Gb judging by the size of savepoints.
The job ran ok. At some point we tried to deploy new code, but wecouldn't take a save point as they were timing out. It looks like thereason it was timing out was due to disk throttle (we use googleregional disks).The new code was deployed using an externalized checkpoint, but itdidn't start as job was failing with:
Causedby: java.io <http://java.io>.FileNotFoundException:Itemnotfound:'gs://../app/checkpoints/2834fa1c81dcf7c9578a8be9a371b0d1/shared/3477b236-fb4b-4a0d-be73-cb6fac62c007'.Note, it is possible that the live version is still available but therequested generation is deleted. atcom.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:45) atcom.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:653) atcom.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.open(GoogleCloudStorageFileSystem.java:277) atcom.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:78) atcom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:620)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
atcom.css.flink.fs.gcs.moved.HadoopFileSystem.open(HadoopFileSystem.java:120) atcom.css.flink.fs.gcs.moved.HadoopFileSystem.open(HadoopFileSystem.java:37) atorg.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:127) atorg.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85) atorg.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69) atorg.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:126) atorg.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:109) atorg.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50) atjava.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) atorg.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) atjava.util.concurrent.CompletableFuture.asyncRunStage(CompletableFuture.java:1654) atjava.util.concurrent.CompletableFuture.runAsync(CompletableFuture.java:1871) atorg.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:83) atorg.apache.flink.contrib.streaming.state.RocksDBStateDownloader.transferAllStateDataToDirectory(RocksDBStateDownloader.java:66) atorg.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.transferRemoteStateToLocalDirectory(RocksDBIncrementalRestoreOperation.java:230) atorg.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:195) atorg.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:169) atorg.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:155) atorg.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)
    ... 15more
We tried to roll back the code, we tried different checkpoints, butall the attempts failed with the same error. The job ID in the erroris not from the same checkpoint path, it looks like restore logiclooks back at previous jobs, as all the checkpoints after2834fa1c81dcf7c9578a8be9a371b0d1 are failing to restore with the sameerror.We looked at different checkpoints and found that some of them missmetadata file and can't be used for restoration.We also use ZK for HA, and we cleaned up the state there betweendeployments to make sure the non existent file
is not coming from there.
We decided to drop the state as we have means to repopulate it, but itwould be great to get to the bottom of it. Any help will be appreciated.
Alex

Re: Corrupted unaligned checkpoints in Flink 1.11.1

Reply via email to