Default Restart Strategy Not Work With Checkpointing

Paul Lam Wed, 01 Aug 2018 05:08:05 -0700

Hi, 
I’m running a Flink 1.5.0 standalone cluster on which `restart-strategy` was 
set to `failure-rate`, and the web frontend shows that the JobManager and the 
TaskManagers are following this configuration, but streaming jobs with 
checkpointing enabled are still using the fixed delay strategy with no respect 
to the default restart strategy (no explicit overwrites in the user code).


I read the source code and found a possible explanation for this (but not very 
sure): the client generates JobGraph without respect to flink-conf.yaml and 
sets the restart strategy to fixed delay if the checkpointing is on, and the 
server side (JobMaster) follows the flink-conf.yaml's default restart strategy 
configuration, but will gave the one in JobGraph a higher priority, so it’s 
always overwritten by the fixed delay strategy. 

If I understand correctly, this might be a bug. Is there anything suggestion to 
avoid it for now?

Best regard,
Paul Lam

Default Restart Strategy Not Work With Checkpointing

Reply via email to