Re: failure-rate restart strategy not working?

Stephan Ewen Fri, 06 Jan 2017 10:09:11 -0800

I think you are right, enabling checkpointing should not override the
cluster settings per se.


This is probably an unwanted artifact of the was that configuration
currently works: Setting explicitly set in the program trump the
cluster-defaults (in the config). Since activating checkpointing sets a
strategy in the ExecutionConfig (program), it overrides the cluster default.

It is definitely not intended in that case. For that specific case, it
makes to simply leave the restart strategy "undefined" and use the "fixed
delay" one at runtime if none other is specified.

Stephan




On Fri, Jan 6, 2017 at 6:44 PM, Shannon Carey <sca...@expedia.com> wrote:

> I think I figured it out: the problem is due to Flink's behavior when a
> job has checkpointing enabled.
>
> When the job graph is created, if checkpointing is enabled but a restart
> strategy hasn't been programmatically configured, Flink changes the job
> graph's execution config to use the fixed delay restart strategy. That gets
> serialized with the job graph. Then, when the JobManager deserializes the
> execution config, it sees that there's a restart strategy configured for
> the job and uses that instead of using the restart strategy that's
> configured on the cluster.
>
> Clearly, the documentation definitely needs to be adjusted. Maybe I can
> add some changes to https://github.com/apache/flink/pull/3059
>
> However, should we also consider some implementation changes? Is it
> intentional that enabling checkpoint overrides the restart strategy set on
> the cluster, and that the only way to control the restart strategy on a
> checkpointed job is to set it programmatically? If not, then would it be
> reasonable to only set fixed-delay restart strategy if checkpointing is
> enabled AND the cluster doesn't explicitly configure it? Flink would no
> longer be use the execution config to control the strategy, but would
> instead do it in the JobManager#submitJob().
>
> -Shannon
>
> From: Shannon Carey <sca...@expedia.com>
> Date: Thursday, January 5, 2017 at 1:50 PM
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: failure-rate restart strategy not working?
>
> I recently updated my cluster with the following config:
>
> restart-strategy: failure-rate
> restart-strategy.failure-rate.max-failures-per-interval: 3
> restart-strategy.failure-rate.failure-rate-interval: 5 min
> restart-strategy.failure-rate.delay: 10 s
>
> I see the settings inside the JobManager web UI, as expected. I am not
> setting the restart-strategy programmatically, but the job does have
> checkpointing enabled.
>
> However, if I launch a job that (intentionally) fails every 10 seconds by
> throwing a RuntimeException, it continues to restart beyond the limit of 3
> failures.
>
> Does anyone know why this might be happening? Any ideas of things I could
> check?
>
> Thanks!
> Shannon
>

Re: failure-rate restart strategy not working?

Reply via email to