Hi, did you create a Jira issue for this? (I'm just getting up to speed after vacation so sorry if you already did this, I didn't yet read the Jira mail.)
Cheers, Aljoscah On Fri, 6 Jan 2017 at 19:08 Stephan Ewen <se...@apache.org> wrote: > I think you are right, enabling checkpointing should not override the > cluster settings per se. > > This is probably an unwanted artifact of the was that configuration > currently works: Setting explicitly set in the program trump the > cluster-defaults (in the config). Since activating checkpointing sets a > strategy in the ExecutionConfig (program), it overrides the cluster default. > > It is definitely not intended in that case. For that specific case, it > makes to simply leave the restart strategy "undefined" and use the "fixed > delay" one at runtime if none other is specified. > > Stephan > > > > > On Fri, Jan 6, 2017 at 6:44 PM, Shannon Carey <sca...@expedia.com> wrote: > > I think I figured it out: the problem is due to Flink's behavior when a > job has checkpointing enabled. > > When the job graph is created, if checkpointing is enabled but a restart > strategy hasn't been programmatically configured, Flink changes the job > graph's execution config to use the fixed delay restart strategy. That gets > serialized with the job graph. Then, when the JobManager deserializes the > execution config, it sees that there's a restart strategy configured for > the job and uses that instead of using the restart strategy that's > configured on the cluster. > > Clearly, the documentation definitely needs to be adjusted. Maybe I can > add some changes to https://github.com/apache/flink/pull/3059 > > However, should we also consider some implementation changes? Is it > intentional that enabling checkpoint overrides the restart strategy set on > the cluster, and that the only way to control the restart strategy on a > checkpointed job is to set it programmatically? If not, then would it be > reasonable to only set fixed-delay restart strategy if checkpointing is > enabled AND the cluster doesn't explicitly configure it? Flink would no > longer be use the execution config to control the strategy, but would > instead do it in the JobManager#submitJob(). > > -Shannon > > From: Shannon Carey <sca...@expedia.com> > Date: Thursday, January 5, 2017 at 1:50 PM > To: "user@flink.apache.org" <user@flink.apache.org> > Subject: failure-rate restart strategy not working? > > I recently updated my cluster with the following config: > > restart-strategy: failure-rate > restart-strategy.failure-rate.max-failures-per-interval: 3 > restart-strategy.failure-rate.failure-rate-interval: 5 min > restart-strategy.failure-rate.delay: 10 s > > I see the settings inside the JobManager web UI, as expected. I am not > setting the restart-strategy programmatically, but the job does have > checkpointing enabled. > > However, if I launch a job that (intentionally) fails every 10 seconds by > throwing a RuntimeException, it continues to restart beyond the limit of 3 > failures. > > Does anyone know why this might be happening? Any ideas of things I could > check? > > Thanks! > Shannon > > >