[ https://issues.apache.org/jira/browse/FLINK-37437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zakelly Lan reassigned FLINK-37437: ----------------------------------- Assignee: Zakelly Lan > ForSt checkpoint failure due to file not visible when writing > ------------------------------------------------------------- > > Key: FLINK-37437 > URL: https://issues.apache.org/jira/browse/FLINK-37437 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Affects Versions: 2.0.0 > Reporter: Zakelly Lan > Assignee: Zakelly Lan > Priority: Blocker > Labels: pull-request-available > Fix For: 2.0.0 > > > We leverage `disableFileDeletions` and `enableFileDeletions` of ForSt to keep > files alive during checkpoint. When checkpoint finished, the > `enableFileDeletions` will be invoked, while ForSt will list all the files > and check status. These files, however, include the writing ones, which is > not visible in some DFS (e.g. oss or s3) yet. Although the checkpoints are > actually not missing, the thrown exception make the finalization of > checkpoint failed, resulting in consecutive cp failures. > The impact would be setups with object storages (e.g. oss or s3). > {code:java} > 2025-03-08 00:03:45,064 INFO > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - > Deduplicate[6] -> Calc[7] -> StreamRecordTimestampInserter[8] -> > nexmark_q18[8]: Writer (1/8)#0 - asynchronous part of checkpoint 9 could not > be completed. > java.util.concurrent.ExecutionException: java.io.FileNotFoundException: > oss://xxxxxxxxxxxxxxxx/abcee1e32ecbdd340adf4139658a1737/shared/op_AsyncKeyedProcessOperator_f6dc7f4d2283f4605b127b9364e21148__1_8__attempt_0/db/83da0fc3-53ee-4a1e-8056-2e5c51593146: > No such file or directory! > at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > ~[?:?] > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) > ~[?:?] > at > org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:511) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:54) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:70) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.base/java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: java.io.FileNotFoundException: > oss://xxxxxxxxxxxxxxxx/abcee1e32ecbdd340adf4139658a1737/shared/op_AsyncKeyedProcessOperator_f6dc7f4d2283f4605b127b9364e21148__1_8__attempt_0/db/83da0fc3-53ee-4a1e-8056-2e5c51593146: > No such file or directory! > at > org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:283) > ~[flink-oss-fs-hadoop-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.fs.osshadoop.common.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:88) > ~[flink-oss-fs-hadoop-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.core.fs.SafetyNetWrapperFileSystem.getFileStatus(SafetyNetWrapperFileSystem.java:78) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.state.forst.fs.ForStFlinkFileSystem.getFileStatus(ForStFlinkFileSystem.java:287) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.state.forst.fs.ForStFlinkFileSystem.listStatus(ForStFlinkFileSystem.java:325) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.state.forst.fs.StringifiedForStFileSystem.listStatus(StringifiedForStFileSystem.java:52) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at org.forstdb.RocksDB.enableFileDeletions(Native Method) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at org.forstdb.RocksDB.enableFileDeletions(RocksDB.java:4312) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.state.forst.snapshot.ForStNativeFullSnapshotStrategy.lambda$syncPrepareResources$3(ForStNativeFullSnapshotStrategy.java:187) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.state.forst.snapshot.ForStSnapshotStrategyBase$ForStNativeSnapshotResources.release(ForStSnapshotStrategyBase.java:299) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.state.forst.snapshot.ForStIncrementalSnapshotStrategy$ForStIncrementalSnapshotOperation.get(ForStIncrementalSnapshotStrategy.java:287) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:91) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:88) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at > org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:78) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > ~[?:?] > at > org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:508) > ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] > ... 7 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)