Re: Flink restart strategy on specific exception

Till Rohrmann Wed, 13 May 2020 05:31:18 -0700

Yes, you are right Zhu Zhu. Extending
the RestartBackoffTimeStrategyFactoryLoader to also load custom
RestartBackoffTimeStrategies sound like a good improvement for the future.


@Ken Krugler <kkrugler_li...@transpac.com>, the old RestartStrategy
interface did not provide the cause of the failure, unfortunately.

Cheers,
Till

On Wed, May 13, 2020 at 7:55 AM Zhu Zhu <reed...@gmail.com> wrote:

> Hi Ken,
>
> Custom restart-strategy was an experimental feature and was deprecated
> since 1.10. [1]
> That's why you cannot find any documentation for it.
>
> The old RestartStrategy was deprecated and replaced by
> RestartBackoffTimeStrategy since 1.10
> (unless you are using the legacy scheduler which was also deprecated).
> The new restart strategy, RestartBackoffTimeStrategy, will be able to know
> the exact failure cause.
> However, the new restart strategy does not support customization at the
> moment.
> Your requirement sounds reasonable to me and I think custom (new) restart
> strategy can be something to support later.
>
> @Till Rohrmann <trohrm...@apache.org> @Gary Yao <g...@apache.org> what do
> you think?
>
> [1]
> https://lists.apache.org/thread.html/6ed95eb6a91168dba09901e158bc1b6f4b08f1e176db4641f79de765%40%3Cdev.flink.apache.org%3E
>
> Thanks,
> Zhu Zhu
>
> Ken Krugler <kkrugler_li...@transpac.com> 于2020年5月13日周三 上午7:34写道：
>
>> Hi Til,
>>
>> Sorry, missed the key question…in the RestartStrategy.restart() method, I
>> don’t see any good way to get at the underlying exception.
>>
>> I can cast the RestartCallback to an ExecutionGraphRestartCallback, but I
>> still need access to the private execGraph to be able to get at the failure
>> info. Is there some other way in the restart handler to get at this?
>>
>> And yes, I meant to note you’d mentioned the required static method in
>> your email, I was asking about documentation for it.
>>
>> Thanks,
>>
>> — Ken
>>
>> ===============================================================
>> Sorry to resurface an ancient question, but is there a working example
>> anywhere of setting a custom restart strategy?
>>
>> Asking because I’ve been wandering through the Flink 1.9 code base for a
>> while, and the restart strategy implementation is…pretty tangled.
>>
>> From what I’ve been able to figure out, you have to provide a factory
>> class, something like this:
>>
>>         Configuration config = new Configuration();
>>         config.setString(ConfigConstants.RESTART_STRATEGY,
>> MyRestartStrategyFactory.class.getCanonicalName());
>>         StreamExecutionEnvironment env =
>> StreamExecutionEnvironment.createLocalEnvironment(4, config);
>>
>> That factory class should extend RestartStrategyFactory, but it also
>> needs to implement a static method that looks like:
>>
>>     public static MyRestartStrategyFactory
>> createFactory(Configuration config) {
>>         return new MyRestartStrategyFactory();
>>     }
>>
>> I wasn’t able to find any documentation that mentioned this particular
>> method being a requirement.
>>
>> And also the documentation at
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#fault-tolerance
>>  doesn’t
>> mention you can set a custom class name for the restart-strategy.
>>
>> Thanks,
>>
>> — Ken
>>
>>
>> On Nov 22, 2018, at 8:18 AM, Till Rohrmann <trohrm...@apache.org> wrote:
>>
>> Hi Kasif,
>>
>> I think in this situation it is best if you defined your own custom
>> RestartStrategy by specifying a class which has a `RestartStrategyFactory
>> createFactory(Configuration configuration)` method as `restart-strategy:
>> MyRestartStrategyFactoryFactory` in `flink-conf.yaml`.
>>
>> Cheers,
>> Till
>>
>> On Thu, Nov 22, 2018 at 7:18 AM Ali, Kasif <kasif....@gs.com> wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> Looking at existing restart strategies they are kind of generic. We have
>>> a requirement to restart the job only in case of specific exception/issues.
>>>
>>> What would be the best way to have a re start strategy which is based on
>>> few rules like looking at particular type of exception or some extra
>>> condition checks which are application specific.?
>>>
>>>
>>>
>>> Just a background on one specific issue which invoked this requirement
>>> is slots not getting released when the job finishes. In our applications,
>>> we keep track of jobs submitted with the amount of parallelism allotted to
>>> it.  Once the job finishes we assume that the slots are free and try to
>>> submit next set of jobs which at times fail with error  “not enough slots
>>> available”.
>>>
>>>
>>>
>>> So we think a job re start can solve this issue but we only want to re
>>> start only if this particular situation is encountered.
>>>
>>>
>>>
>>> Please let us know If there are better ways to solve this problem other
>>> than re start strategy.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Kasif
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices
>>>
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>

Re: Flink restart strategy on specific exception

Reply via email to