Re: [DISCUSS] Deadline Alert Callbacks

Elad Kalif Fri, 23 May 2025 07:30:58 -0700

Ramit I think we also agreed that we need to improve the common notifiers
(at least Slack) with templates that will prevent the need to customize the
code from the user side. The problem is explained in
https://github.com/apache/airflow/issues/35381 and was solved for SMTP only
in https://github.com/apache/airflow/pull/36226.


On Thu, May 22, 2025 at 11:57 PM Kataria, Ramit <ramit...@amazon.com.invalid>
wrote:

> Following today’s dev call discussion, the community aligned on proceeding
> with a variation of option 1. The check for deadline misses will happen in
> the scheduler main loop (not a separate or child process)
> For callbacks, suggested practice will be to use async callbacks, and the
> existing concept of Notifiers will be the primary recommended callback
> type. Async callbacks like those will be sent to the existing Triggerer for
> execution.
> If the user chooses to provide a synchronous callback, then it will be
> sent to the default worker/executor. We will attempt to prioritize these if
> there is a queue that supports prioritization.
>
> On 2025-05-22, 8:18 AM, "Ash Berlin-Taylor" <a...@apache.org <mailto:
> a...@apache.org>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
>
>
>
> Yeah, that’s how I understood it to mean, and then the callback would go
> to a worker (or a trigger) to run the user code.
>
>
> -a
>
>
> > On 22 May 2025, at 16:06, Jarek Potiuk <ja...@potiuk.com <mailto:
> ja...@potiuk.com>> wrote:
> >
> >> Option 1 as originally proposed in this thread only does the "is a
> > callback
> > required" check in scheduler -- not running the callback in scheduler
> >
> > Ah - then OK. I thought it's also callback execution. Just checking is
> fine
> > in scheduler.
> >
> > On Thu, May 22, 2025 at 4:55 PM Daniel Standish
> > <daniel.stand...@astronomer.io.inva <mailto:
> daniel.stand...@astronomer.io.inva>lid> wrote:
> >
> >> Option 1 as originally proposed in this thread only does the "is a
> callback
> >> required" check in scheduler -- not running the callback in scheduler.
> >>
> >> On Thu, May 22, 2025 at 7:22 AM Jarek Potiuk <ja...@potiuk.com <mailto:
> ja...@potiuk.com>> wrote:
> >>
> >>>> So I very strongly vote for Option 1, and if needed make the scheduler
> >>> itself more resilient. The Airflow Scheduler _IS_ airflow. Let’s do
> what
> >> we
> >>> need to in order to make it more stable, rather than working around a
> >>> problem of our own making, whilst also making it operationally more
> >> complex
> >>> to run.
> >>>
> >>> Hey Ash - I forgot to add. Option 1 is against our new security model.
> >> This
> >>> is essentially DAG author code executed in the scheduler. Ash - do you
> >>> think it is possible to avoid that ? For DAG parsing it resulted with
> >>> mandatory dag-processor command separated from scheduler, so I am not
> >> sure
> >>> how we would solve the security issue here? Or maybe there is another
> >> idea
> >>> on how to solve it? That would be possible if we had deadline callbacks
> >>> defined in the plugins, but again - I think the idea was to be able to
> >>> provide callbacks by DAG authors (which IMHO is synonymous with "we do
> >> not
> >>> run it in scheduler".
> >>>
> >>> We could potentially run the callbacks in the Dag processor (which we
> >>> already did BTW). but I am not sure if this is what we want.
> >>>
> >>> J.
> >>>
> >>>
> >>> On Thu, May 22, 2025 at 3:40 PM Elad Kalif <elad...@apache.org
> <mailto:elad...@apache.org>> wrote:
> >>>
> >>>> My comment on the name is for the suggested component that runs the
> >>>> workload. It's not about the feature itself. I just suggest a more
> >>> generic
> >>>> name so if the need comes it would be easier to execute different kind
> >> of
> >>>> workloads on it (like callbacks).
> >>>>
> >>>> As for reuse the Triggerer I am not a fan of that. It serve a
> >> completely
> >>>> different porpuse and combining both cases may result in poor usage of
> >>> auto
> >>>> scaling. I don't think alerts/callbacks/other "misc" should compete on
> >>> the
> >>>> same resources as actual tasks.
> >>>>
> >>>> בתאריך יום ה׳, 22 במאי 2025, 16:19, מאת Jarek Potiuk ‏<
> >> ja...@potiuk.com <mailto:ja...@potiuk.com>
> >>>> :
> >>>>
> >>>>> How about Option 3) making it part of triggerer.
> >>>>>
> >>>>> I think that goes in the direction we've been discussing in the past
> >>>> where
> >>>>> we have 'generic workload" that we can submit from any of the other
> >>>>> components that will be executed in triggerer.
> >>>>>
> >>>>> * that would not add too much complexity - no extra process to manage
> >>>>> * triggerer is obligatory part of installation now anyway
> >>>>> * usually machines today have more processors and triggerer, with its
> >>>> event
> >>>>> loop does not seem to be too busy in terms of multi-processor usage
> >>>> (there
> >>>>> are extra processes accessing the DB but still not much I think). It
> >>>> could
> >>>>> fork another process to run just deadline checks.
> >>>>> * re - multi-team it's even easier, triggerer is already going to be
> >>>>> "per-team".
> >>>>> * we could even rename triggerer to "generic workload processor"
> >> (well
> >>>>> shorter name, but to indicate that it could process any kind of
> >>>> workloads -
> >>>>> not only deferred triggers).
> >>>>>
> >>>>> Re: comments from Elad:
> >>>>>
> >>>>> 1) Naming wise: I think we settled on the name already (looong
> >>>> discussion,
> >>>>> naming is hard) and I think the scope of it is just really
> >> "deadlines"
> >>>> (we
> >>>>> also wanted to distinguish it from SLA) - i like the name for this
> >>>>> particular callback type, but yes - I agree it should be more
> >> generic,
> >>>> open
> >>>>> for any future types of callbacks. If we go for triggerer handling
> >>>> "generic
> >>>>> workload" - that is IMHO "generic enough" to handle any future
> >>> workloads
> >>>>>
> >>>>> 2) I believe this is something that could be handled by the callback.
> >>>>> Callback could have the option to be able to submit "cancel" request
> >>> for
> >>>>> the task it is called back for (via task.sdk API) - but that should
> >> be
> >>>> up
> >>>>> to the one who writes the callback.
> >>>>>
> >>>>> J.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, May 22, 2025 at 10:03 AM Elad Kalif <elad...@apache.org
> <mailto:elad...@apache.org>>
> >>> wrote:
> >>>>>
> >>>>>> I prefer option 2 but I have questions.
> >>>>>> 1. Naming wise maybe we should prefer a more generic name as I am
> >> not
> >>>>> sure
> >>>>>> if it should be limited to deadlines? (maybe should be shared with
> >>>>>> executing callbacks?)
> >>>>>> 2. How do you plan to manage the queue of alerts? What happens if
> >> the
> >>>>>> process is unhealthy while workers continue to execute tasks?
> >>>>>>
> >>>>>> On Thu, May 22, 2025 at 12:56 AM Ryan Hatter
> >>>>>> <ryan.hat...@astronomer.io.inva <mailto:
> ryan.hat...@astronomer.io.inva>lid> wrote:
> >>>>>>
> >>>>>>> +1 for option 2, primarily because of:
> >>>>>>>
> >>>>>>> It would be more robust and resilient, and therefore be able to
> >>> run
> >>>>> the
> >>>>>>>> callbacks *even in presence of certain kinds of issues like the
> >>>>>> scheduler
> >>>>>>>> being bogged-down*
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit
> >>>>>> <ramit...@amazon.com.inva <mailto:ramit...@amazon.com.inva>lid
> >>>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> I’m working with Dennis on Deadline Alerts (AIP-86). I'd like
> >> to
> >>>>>> discuss
> >>>>>>>> implementation approaches for executing callbacks when Deadline
> >>>>> Alerts
> >>>>>>> are
> >>>>>>>> triggered. As you may know, the old SLA feature has been
> >> removed,
> >>>> and
> >>>>>>> we're
> >>>>>>>> planning to introduce Deadline Alerts as a replacement in 3.1.
> >>>> When a
> >>>>>>>> deadline is missed, we need a mechanism to execute callbacks
> >>> (which
> >>>>>> could
> >>>>>>>> be notifications or other actions).
> >>>>>>>>
> >>>>>>>> I’ve identified two main approaches:
> >>>>>>>>
> >>>>>>>> Option 1: Scheduler-based
> >>>>>>>> In this approach, the scheduler would check on a regular
> >> interval
> >>>> to
> >>>>>> see
> >>>>>>>> if the earliest deadline has passed and then queue the callback
> >>> to
> >>>>> run
> >>>>>> in
> >>>>>>>> an executor (local or remote). The executor would be specified
> >>> when
> >>>>>>>> creating the deadline alert and if there’s none specified, then
> >>> the
> >>>>>>> default
> >>>>>>>> executor would be used.
> >>>>>>>>
> >>>>>>>> Option 2: New DeadlineProcessor process
> >>>>>>>> In this approach, there would be a new process similar to
> >>>>>>>> triggerer/dag-processor completely independent from the
> >> scheduler
> >>>> to
> >>>>>>> check
> >>>>>>>> for deadlines on a regular interval and also run the callbacks
> >>>>> without
> >>>>>>>> queueing it in another executor.
> >>>>>>>>
> >>>>>>>> Multi-team considerations: For multi-team later this year,
> >>> option 2
> >>>>>> would
> >>>>>>>> be relatively simple to implement. However, for option 1, the
> >>>>> callbacks
> >>>>>>>> would have to run on a remote executor since there would be no
> >>>> local
> >>>>>>>> executor.
> >>>>>>>>
> >>>>>>>> I recommend going with option 2 because:
> >>>>>>>>
> >>>>>>>> * It would be more robust and resilient, and therefore be
> >>> able
> >>>> to
> >>>>>> run
> >>>>>>>> the callbacks even in presence of certain kinds of issues like
> >>> the
> >>>>>>>> scheduler being bogged-down
> >>>>>>>> * It would also run the callbacks almost instantly instead
> >> of
> >>>>>> having
> >>>>>>>> to wait for an executor (especially if there’s a long queue of
> >>>> tasks
> >>>>>> or a
> >>>>>>>> cold-start delay)
> >>>>>>>> * This could be mitigated by implementing a priority
> >>> system
> >>>>>> where
> >>>>>>>> the deadline callbacks are prioritized over regular tasks but
> >>> this
> >>>>> is a
> >>>>>>>> non-trivial problem with my current understanding of Airflow’s
> >>>>>>> architecture
> >>>>>>>> * It would avoid a potential slight increase in workload
> >> for
> >>>> the
> >>>>>>>> scheduler
> >>>>>>>> * The additional workload in the scheduler for option 1
> >>>> would
> >>>>> be
> >>>>>>>> checking to see if the earliest deadline has passed on a
> >> regular
> >>>>>> interval
> >>>>>>>>
> >>>>>>>> However, it would introduce another process for admins to
> >> deploy
> >>>> and
> >>>>>>>> manage, and also likely require more effort to implement,
> >>> therefore
> >>>>>>> taking
> >>>>>>>> longer to complete.
> >>>>>>>>
> >>>>>>>> So, I’d like to hear your thoughts on these approaches,
> >> anything
> >>> I
> >>>>> may
> >>>>>>>> have missed and if you agree/disagree with this direction.
> >> Thank
> >>>> you
> >>>>>> for
> >>>>>>>> your input!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Ramit Kataria
> >>>>>>>> SDE at AWS
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org <mailto:
> dev-unsubscr...@airflow.apache.org>
> For additional commands, e-mail: dev-h...@airflow.apache.org <mailto:
> dev-h...@airflow.apache.org>
>
>
>
>
>
>

Re: [DISCUSS] Deadline Alert Callbacks

Reply via email to