[ https://issues.apache.org/jira/browse/FLINK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667321#comment-15667321 ]
ASF GitHub Bot commented on FLINK-5063: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2812 [FLINK-5063] Discard state handles of declined or expired state handles Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs to the job maintained by the checkpoint coordinator, it should either record the state handles for later processing or discard to free the resources. The latter case can happen if a checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthermore, it can happen if a Task sent a decline checkpoint message while other Tasks where still drawing a checkpoint. This PR changes the behaviour such that state handles belonging to the job of the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint. Review @uce, @StephanEwen You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixStateHandleCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2812.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2812 ---- commit c4c000d1b39de5617b6796eed524ce2a449100d3 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-11-14T17:33:55Z [FLINK-5063] Discard state handles of declined or expired state handles Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs to the job maintained by the checkpoint coordinator, it should either record the state handles for later processing or discard to free the resources. The latter case can happen if a checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthremore, it can happen if a Task sent a decline checkpoint message while other Tasks where still drawing a checkpoint. This PR changes the behaviour such that state handles belonging to the job of the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint. ---- > State handles are not properly cleaned up for declined or expired checkpoints > ----------------------------------------------------------------------------- > > Key: FLINK-5063 > URL: https://issues.apache.org/jira/browse/FLINK-5063 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.2.0, 1.1.3 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Critical > Fix For: 1.2.0, 1.1.4 > > > In case that a {{Checkpoint}} is declined or expires, the > {{CheckpointCoordinator}} will dispose the {{PendingCheckpoint}}. Disposing > the {{PendingCheckpoint}} entails that all so far registered > {{SubtaskStates}} of the acknowledged {{Tasks}} are discarded. However, all > late arriving acknowledge messages are simply ignored without properly > discarding the transmitted state handles. This can lead to a cluttering of > checkpoint directory since the checkpoint files of late or unknown > acknowledge checkpoint messages are never deleted. > I propose to properly discard the state handles at the > {{CheckpointCoordinator}} if receiving a late acknowledge message or an > acknowledge message for an unknown {{ExecutionAttemptID}} belonging to the > job of the {{CheckpointCoordinator}}. However, checkpoint messages belonging > to a different job won't be handled and simply ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)