Re: failure-rate restart strategy not working?

Aljoscha Krettek Mon, 09 Jan 2017 07:58:45 -0800

Hi,
did you create a Jira issue for this? (I'm just getting up to speed after
vacation so sorry if you already did this, I didn't yet read the Jira mail.)


Cheers,
Aljoscah

On Fri, 6 Jan 2017 at 19:08 Stephan Ewen <se...@apache.org> wrote:

> I think you are right, enabling checkpointing should not override the
> cluster settings per se.
>
> This is probably an unwanted artifact of the was that configuration
> currently works: Setting explicitly set in the program trump the
> cluster-defaults (in the config). Since activating checkpointing sets a
> strategy in the ExecutionConfig (program), it overrides the cluster default.
>
> It is definitely not intended in that case. For that specific case, it
> makes to simply leave the restart strategy "undefined" and use the "fixed
> delay" one at runtime if none other is specified.
>
> Stephan
>
>
>
>
> On Fri, Jan 6, 2017 at 6:44 PM, Shannon Carey <sca...@expedia.com> wrote:
>
> I think I figured it out: the problem is due to Flink's behavior when a
> job has checkpointing enabled.
>
> When the job graph is created, if checkpointing is enabled but a restart
> strategy hasn't been programmatically configured, Flink changes the job
> graph's execution config to use the fixed delay restart strategy. That gets
> serialized with the job graph. Then, when the JobManager deserializes the
> execution config, it sees that there's a restart strategy configured for
> the job and uses that instead of using the restart strategy that's
> configured on the cluster.
>
> Clearly, the documentation definitely needs to be adjusted. Maybe I can
> add some changes to https://github.com/apache/flink/pull/3059
>
> However, should we also consider some implementation changes? Is it
> intentional that enabling checkpoint overrides the restart strategy set on
> the cluster, and that the only way to control the restart strategy on a
> checkpointed job is to set it programmatically? If not, then would it be
> reasonable to only set fixed-delay restart strategy if checkpointing is
> enabled AND the cluster doesn't explicitly configure it? Flink would no
> longer be use the execution config to control the strategy, but would
> instead do it in the JobManager#submitJob().
>
> -Shannon
>
> From: Shannon Carey <sca...@expedia.com>
> Date: Thursday, January 5, 2017 at 1:50 PM
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: failure-rate restart strategy not working?
>
> I recently updated my cluster with the following config:
>
> restart-strategy: failure-rate
> restart-strategy.failure-rate.max-failures-per-interval: 3
> restart-strategy.failure-rate.failure-rate-interval: 5 min
> restart-strategy.failure-rate.delay: 10 s
>
> I see the settings inside the JobManager web UI, as expected. I am not
> setting the restart-strategy programmatically, but the job does have
> checkpointing enabled.
>
> However, if I launch a job that (intentionally) fails every 10 seconds by
> throwing a RuntimeException, it continues to restart beyond the limit of 3
> failures.
>
> Does anyone know why this might be happening? Any ideas of things I could
> check?
>
> Thanks!
> Shannon
>
>
>

Re: failure-rate restart strategy not working?

Reply via email to