[ 
https://issues.apache.org/jira/browse/FLINK-30863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695042#comment-17695042
 ] 

Yanfei Lei commented on FLINK-30863:
------------------------------------

Looking back at the fileNotFound problem of local recovery again, I found that 
my previous analysis was incorrect:

For a checkpoint, notifyAbort() is impossible to come after notifyComplete() on 
TM.

If TM is materialized before receiving confirm(), the previously uploaded queue 
in `FsStateChangelogWriter` will be cleared, so the local files of the 
completed checkpoint will not be registered again, while the JM owned files are 
registered before confirm(), and do not depend on the uploaded queue, so the 
local files are deleted, and the DFS files are still there.

I added  `testLocalFileAfterMaterialize` to simulate this scenario, and I think 
local files should be registered before confirm() to avoid this problem. 
[~roman]  [~Feifan Wang] could you please take a look again?

 

> Do not delete the local changelog file of aborted checkpoint
> ------------------------------------------------------------
>
>                 Key: FLINK-30863
>                 URL: https://issues.apache.org/jira/browse/FLINK-30863
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: tm-log_fail_cl_local_recovery.txt
>
>
> Do not delete the local changelog file of aborted checkpoint, because this 
> checkpoint may contain the files of the previous checkpoint's file which 
> would be used by local recovery. The local files of the aborted checkpoint 
> would be deleted at next checkpoint completed or deleted when deleting the 
> entire allocation folder when exiting the TM process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to