[jira] [Commented] (FLINK-5214) Clean up checkpoint files when failing checkpoint operation on TM

Xiaogang Shi (JIRA) Wed, 30 Nov 2016 17:43:47 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15710512#comment-15710512
 ]


Xiaogang Shi commented on FLINK-5214:
-------------------------------------

I opened FLINK-5086 to report a similar problem, but I do not have a good idea 
how to resolve it. 

Because JM does know the existence of these checkpoint files, it seems only TM 
can delete them. But as a failed TM may not be recovered by the JM if the 
number of retries exceeds the given limit,  these files will not be deleted in 
such cases.

One possible solution i think is to let each TM return a handler to JM when the 
TM is registered. JM can use the handler to clean the files even when the TM 
fails. 

Another solution is to recover the TM when the number of retries exceeds the 
limit. Once the TM is recovered, the only thing it does is to clean the 
checkpoint files.

Do you have any better ideas?

> Clean up checkpoint files when failing checkpoint operation on TM
> -----------------------------------------------------------------
>
>                 Key: FLINK-5214
>                 URL: https://issues.apache.org/jira/browse/FLINK-5214
>             Project: Flink
>          Issue Type: Bug
>          Components: TaskManager
>    Affects Versions: 1.2.0, 1.1.3
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>             Fix For: 1.2.0, 1.1.4
>
>
> When the {{StreamTask#performCheckpoint}} operation fails on a 
> {{TaskManager}} potentially created checkpoint files are not cleaned up. This 
> should be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-5214) Clean up checkpoint files when failing checkpoint operation on TM

Reply via email to