[ 
https://issues.apache.org/jira/browse/FLINK-37437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-37437.
---------------------------------
    Resolution: Fixed

> ForSt checkpoint failure due to file not visible when writing
> -------------------------------------------------------------
>
>                 Key: FLINK-37437
>                 URL: https://issues.apache.org/jira/browse/FLINK-37437
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 2.0.0
>            Reporter: Zakelly Lan
>            Assignee: Zakelly Lan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>
> We leverage `disableFileDeletions` and `enableFileDeletions` of ForSt to keep 
> files alive during checkpoint. When checkpoint finished, the 
> `enableFileDeletions` will be invoked, while ForSt will list all the files 
> and check status. These files, however, include the writing ones, which is 
> not visible in some DFS (e.g. oss or s3) yet. Although the checkpoints are 
> actually not missing, the thrown exception make the finalization of 
> checkpoint failed, resulting in consecutive cp failures.
> The impact would be setups with object storages (e.g. oss or s3).
> {code:java}
> 2025-03-08 00:03:45,064 INFO  
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
> Deduplicate[6] -> Calc[7] -> StreamRecordTimestampInserter[8] -> 
> nexmark_q18[8]: Writer (1/8)#0 - asynchronous part of checkpoint 9 could not 
> be completed.
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: 
> oss://xxxxxxxxxxxxxxxx/abcee1e32ecbdd340adf4139658a1737/shared/op_AsyncKeyedProcessOperator_f6dc7f4d2283f4605b127b9364e21148__1_8__attempt_0/db/83da0fc3-53ee-4a1e-8056-2e5c51593146:
>  No such file or directory!
>     at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) 
> ~[?:?]
>     at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) 
> ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:511)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:54)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:70) 
> ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>     at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.io.FileNotFoundException: 
> oss://xxxxxxxxxxxxxxxx/abcee1e32ecbdd340adf4139658a1737/shared/op_AsyncKeyedProcessOperator_f6dc7f4d2283f4605b127b9364e21148__1_8__attempt_0/db/83da0fc3-53ee-4a1e-8056-2e5c51593146:
>  No such file or directory!
>     at 
> org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:283)
>  ~[flink-oss-fs-hadoop-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.fs.osshadoop.common.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:88)
>  ~[flink-oss-fs-hadoop-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.getFileStatus(SafetyNetWrapperFileSystem.java:78)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.state.forst.fs.ForStFlinkFileSystem.getFileStatus(ForStFlinkFileSystem.java:287)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.state.forst.fs.ForStFlinkFileSystem.listStatus(ForStFlinkFileSystem.java:325)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.state.forst.fs.StringifiedForStFileSystem.listStatus(StringifiedForStFileSystem.java:52)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at org.forstdb.RocksDB.enableFileDeletions(Native Method) 
> ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at org.forstdb.RocksDB.enableFileDeletions(RocksDB.java:4312) 
> ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.state.forst.snapshot.ForStNativeFullSnapshotStrategy.lambda$syncPrepareResources$3(ForStNativeFullSnapshotStrategy.java:187)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.state.forst.snapshot.ForStSnapshotStrategyBase$ForStNativeSnapshotResources.release(ForStSnapshotStrategyBase.java:299)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.state.forst.snapshot.ForStIncrementalSnapshotStrategy$ForStIncrementalSnapshotOperation.get(ForStIncrementalSnapshotStrategy.java:287)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:91)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:88)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at 
> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:78)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 
> ~[?:?]
>     at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:508)
>  ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
>     ... 7 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to