Re: Task Failure Strategy for Adaptive Scheduler

Talat Uyarer via user Tue, 18 Apr 2023 09:13:12 -0700

Hi David,

Yes We have multiple disjoint DAGs in one job. We want better CPU
utilization.  Open Source Flink has a scheduling issue with those types of
jobs. I made a fix on 1.13 with AS.  Now we are scheduling evenly for all
DAGs. However somehow when we get an exception on a DAG we dont want to
affect others and restart only whichever gets the exception. I believe the
Region Pipelined model is good for what the Default scheduler has.


Do you have anything in your mind that addresses restarting other than
Regioned Pipelines ?

Thanks



On Tue, Apr 18, 2023 at 12:19 AM David Morávek <david.mora...@gmail.com>
wrote:

> > Our DAG has multiple sources which are not connected to each other.
>
> To double-check, are you saying the job consists of multiple disjoint DAGs?
>
> >  Do you think somehow the adaptive scheduler supports region pipelines
> for streaming jobs ?
>
> It's doable but might not be straightforward since the AS recycles
> ExecutionGraph during restart. It has been a low priority so far because
> it's mainly valuable for batch jobs, but we might reconsider it if there
> are enough use cases.
>
> Best,
> D.
>
> On Sat, Apr 15, 2023 at 8:15 AM Talat Uyarer <tuya...@paloaltonetworks.com>
> wrote:
>
>> Thanks David and others.
>>
>> Our DAG has multiple sources which are not connected to each other. If
>> one of them fails, I believe Flink can restart a single region for
>> defaultscheduler. but it is not the same case for adaptive scheduler. Do
>> you think somehow the adaptive scheduler supports region pipelines for
>> streaming jobs ? If it can, local recovery makes sense at that time I
>> believe.
>>
>> Thanks
>>
>>
>> On Wed, Apr 12, 2023 at 2:15 AM David Morávek <david.mora...@gmail.com>
>> wrote:
>>
>>> Hi Talat,
>>>
>>> For most streaming pipelines, we have to restart the whole pipeline no
>>> matter the scheduler used because they're a single pipelined region. One
>>> limitation of AdaptiveScheduler is the lack of support for local recovery.
>>> This will be addressed in Flink 1.18 [1].
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-21450
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D21450&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=io1uqfos6mZy0_MjhX-EMn6Kv1O-J51WEdXoLFL2-JdSmUURlkVSK9Jo06K7PXbt&s=duMv5JJ1nSCwJNlA1eG3uaopaPpgwqrBfBnbegFEb7s&e=>
>>>
>>> Best,
>>> D.
>>>
>>> On Tue, Apr 11, 2023 at 4:35 AM Weihua Hu <huweihua....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> AFAIK, the reactive mode always restarts the whole pipeline now.
>>>>
>>>> Best,
>>>> Weihua
>>>>
>>>>
>>>> On Tue, Apr 11, 2023 at 8:38 AM Talat Uyarer via user <
>>>> user@flink.apache.org> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We use Flink 1.13 with reactive mode for our streaming jobs. When we
>>>>> have an issue/exception on our pipeline. Flink rescheduled all tasks. Is
>>>>> there any way to reschedule only task that had exceptions ?
>>>>>
>>>>> Thanks
>>>>>
>>>>

Re: Task Failure Strategy for Adaptive Scheduler

Reply via email to