[ https://issues.apache.org/jira/browse/FLINK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493204#comment-16493204 ]
ASF GitHub Bot commented on FLINK-9352: --------------------------------------- GitHub user yanghua opened a pull request: https://github.com/apache/flink/pull/6092 [FLINK-9352] In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure ## What is the purpose of the change *This pull request fixed a problem : In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure* ## Brief change log - *Replace the scheduler's initial delay time from baseInterval to a random num between min pause and base interval* ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / **not documented**) You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanghua/flink FLINK-9352 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6092.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6092 ---- commit 1eb432833bf2dd23187194500b6e1c6523f30605 Author: yanghua <yanghua1127@...> Date: 2018-05-29T07:59:48Z [FLINK-9352] In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure ---- > In Standalone checkpoint recover mode many jobs with same checkpoint interval > cause IO pressure > ----------------------------------------------------------------------------------------------- > > Key: FLINK-9352 > URL: https://issues.apache.org/jira/browse/FLINK-9352 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing > Affects Versions: 1.5.0, 1.4.2, 1.6.0 > Reporter: vinoyang > Assignee: vinoyang > Priority: Major > > currently, the periodic checkpoint coordinator startCheckpointScheduler uses > *baseInterval* as the initialDelay parameter. the *baseInterval* is also the > checkpoint interval. > In standalone checkpoint mode, many jobs config the same checkpoint interval. > When all jobs being recovered (the cluster restart or jobmanager leadership > switched), all jobs' checkpoint period will tend to accordance. All jobs' > CheckpointCoordinator would start and trigger in a approximate time point. > This caused the high IO cost in the same time period in our production > scenario. > I suggest let the scheduleAtFixedRate's initial delay parameter as a API > config which can let user scatter checkpoint in this scenario. > > cc [~StephanEwen] [~Zentol] -- This message was sent by Atlassian JIRA (v7.6.3#76005)