Re: [DISCUSS] Mechanism of SLA

Sung Yun Thu, 15 Jun 2023 07:05:56 -0700

Hello!

Thank you very much for the feedback on the proposal. I’ve been hoping to get 
some more traction on this proposal, so it’s great to hear from another user of 
the feature.

I understand that there’s a lot of support for keeping a native task level SLA 
feature, and I definitely agree with that sentiment. Our organization very much 
relies on Airflow to evaluate ‘task_sla’ in order to keep track of which tasks 
in each dags failed to succeed by an expected time. 

In the AIP I put together on the Confluence page, and in the Google docs, I 
have identified why the existing implementation of the task level SLA feature 
can be problematic and is often misleading for Airflow users. The feature is 
also quite costly for Airflow scheduler and dag_processor.

In that sense, the discussion is not about whether or not these SLA features 
are important to the users, but much more technical. Can a task-level feature 
be supported in a first-class way as a core feature of Airflow, or should it be 
implemented by the users, for example as independent tasks by leveraging 
Deferrable Operators. 

My current thought is that only Dag level SLAs can be supported in a 
non-disruptive way by the scheduler, and that task level SLAs should be handled 
outside of core Airflow infrastructure code. If you strongly believe otherwise, 
I think it would be helpful if you could propose an alternative technical 
solution that solves many of the existing problems in the task-level SLA 
feature.

Sent from my iPhone

> On Jun 13, 2023, at 1:10 PM, Ярослав Поскряков <yaroslavposkrya...@gmail.com> 
> wrote:
> 
> Mechanism of SLA
> 
> Hi, I read the previous conversation regarding SLA and I think removing the
> opportunity to set sla for the task level will be a big mistake.
> So, the proposed implementation of the task level SLA will not be working
> correctly.
> 
> That's why I guess we have to think about the mechanism of using SLA.
> 
> I guess we should check three different cases in general.
> 
> 
> 1. It doesn't matter for us how long we are spending time on some specific
> task. It's important to have an understanding of the lag between
> execution_date of dag and success state for the task. We can call it
> dag_sla. It's similar to the current implementation of manage_slas.
> 
> 
> 2. It's important to have an understanding and managing how long some
> specific task is working. In my opinion working is the state between task
> last start_date and task first (after last start_date) SUCCESS state. So
> for example for the task which is placed in FAILED state we still have to
> check an SLA in that strategy. We can call it task_sla.
> 
> 
> 3. Sometimes we need to manage time for the task in the RUNNING state. We
> can call it time_limit_sla.
> 
> 
> Those three types of SLA will cover all possible cases.
> 
> 
> So we will have three different strategies for SLA.
> 
> 
> I guess we can use for dag_sla that idea -
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> 
> 
> For task_sla and time_limit_sla I prefer to stay with using SchedulerJob
> 
> 
> Github: Yaro1

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] Mechanism of SLA

Reply via email to