[ https://issues.apache.org/jira/browse/FLINK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207203#comment-16207203 ]
ASF GitHub Bot commented on FLINK-7844: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/4844 [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoints for fine grained recovery ## What is the purpose of the change This commit will fail all pending checkpoints which have not been acknowledged by the failed task in case of fine grained recovery. This is done in order to avoid long checkpoint timeouts which might block the CheckpointCoordinator from triggering new checkpoints. ## Brief change log - Introduce `CheckpointCoordinator#failUnacknowledgedPendingCheckpointsFor` to fail all unacknowledged pending checkpoints for a given `ExecutionAttemptID` - Fail unacknowledged pending checkpoints in `ExecutionGraph#notifyExecutionChange` ## Verifying this change - `IndividualRestartsConcurrencyTest#testLocalFailureFailsPendingCheckpoints` tests that unacknowledged pending checkpoints are discarded and removed from the `CheckpointCoordinator` in case of a local failure ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink failCheckpointsFineGrainedRecovery Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4844.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4844 ---- commit 454066b2f606f83e159e553506d58ce3a49a256d Author: Till <till.rohrm...@gmail.com> Date: 2017-10-17T08:57:37Z [FLINK-7844] [ckPt] Fail unacknowledged pending checkpoints for fine grained recovery This commit will fail all pending checkpoints which have not been acknowledged by the failed task in case of fine grained recovery. This is done in order to avoid long checkpoint timeouts which might block the CheckpointCoordinator from triggering new checkpoints ---- > Fine Grained Recovery triggers checkpoint timeout failure > --------------------------------------------------------- > > Key: FLINK-7844 > URL: https://issues.apache.org/jira/browse/FLINK-7844 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.4.0, 1.3.2 > Reporter: Zhenzhong Xu > Assignee: Zhenzhong Xu > Attachments: screenshot-1.png > > > Context: > We are using "individual" failover (fine-grained) recovery strategy for our > embarrassingly parallel router use case. The topic has over 2000 partitions, > and parallelism is set to ~180 that dispatched to over 20 task managers with > around 180 slots. > Observations: > We've noticed after one task manager termination, even though the individual > recovery happens correctly, that the workload was re-dispatched to a new > available task manager instance. However, the checkpoint would take 10 mins > to eventually timeout, causing all other task managers not able to commit > checkpoints. In a worst-case scenario, if job got restarted for other reasons > (i.e. job manager termination), that would cause more messages to be > re-processed/duplicates compared to the job without fine-grained recovery > enabled. > I am suspecting that uber checkpoint was waiting for a previous checkpoint > that initiated by the old task manager and thus taking a long time to time > out. > Two questions: > 1. Is there a configuration that controls this checkpoint timeout? > 2. Is there any reason that when Job Manager realizes that Task Manager is > gone and workload is redispatched, it still need to wait for the checkpoint > initiated by the old task manager? > Checkpoint screenshot in attachments. -- This message was sent by Atlassian JIRA (v6.4.14#64029)