vinoyang created FLINK-9352:
-------------------------------

             Summary: In Standalone checkpoint recover mode many jobs with same 
checkpoint interval cause IO pressure
                 Key: FLINK-9352
                 URL: https://issues.apache.org/jira/browse/FLINK-9352
             Project: Flink
          Issue Type: Bug
          Components: State Backends, Checkpointing
            Reporter: vinoyang
            Assignee: vinoyang


currently, the periodic checkpoint coordinator startCheckpointScheduler uses 
*baseInterval* as the initialDelay parameter. the *baseInterval* is also the 
checkpoint interval. 

In standalone checkpoint mode, many jobs config the same checkpoint interval. 
When all jobs being recovered (the cluster restart or jobmanager leadership 
switched), all jobs' checkpoint period will tend to accordance. All jobs' 
CheckpointCoordinator would start and trigger in a approximate time point.

This caused the high IO cost in the same time period in our production scenario.

I suggest let the scheduleAtFixedRate's initial delay parameter as a API config 
which can let user scatter checkpoint in this scenario.

 

cc [~StephanEwen] [~Zentol]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to