> Would it be possible to develop this out-of-tree for the time being?

Oh absolutely. I definitely do not want to add "more" on the Airflow 3
band-wagon.
I am even quite sure I will not be the one implementing it, nor anyone
involved in Airflow 3, It's more of a conceptual discussion - and an
attempt to make it as an interesting idea someone could take a closer look
at.

Manikandan,

Great to hear from you, it's fantastic to hear from maintainers of other
projects. But for your information -  we are now in the process of
completely rewriting part of Airflow 3, which means that we are deep down
and busy in various parts - and as Ash mentioned, we do not want to
"complicate" things but adding more **just now**.

And I really love the idea of starting something outside in parallel. There
are a number of people here who are not that deeply involved in Airflow 3
and they could take on the discussion - and maybe even work with Manikandan
directly on such an executor - somewhere on the side - just to explore more
of what Manikandan just started.

I'd be curious to see the result of such a work where someone with deeper
Airflow understanding (but not necessarily involved in Airflow 3 work)
could make some self-guided experiment and look at what could be achieved
even by developing an executor POC that would work for Airflow 2 ?

Maybe someone ?


J.

On Tue, Oct 22, 2024 at 1:22 PM Ash Berlin-Taylor <a...@apache.org> wrote:

> This looks like it has some really cool features.
>
> > *Tl;DR; I would love to start discussion about creating (for Airflow 3.x
> -
> it does not have to be Airflow 3.0) a new community executor based on
> YuniKorn*
>
> I think this caveat to me is the main point, as long as it’s not in 3.0
> (and ideally for me not even in repo for the next few months) for two
> reasons:
>
> 1. AIP 72 is going to change the Executor interface somewhat, and we don’t
> know the exact details of how yet, so having to not worry about another
> executor to fix up and ensure works would be good to now slow down
> development of 3.0; and
> 2. I’m slightly nervous about the extra support load of a new executor at
> this time. It’s probably not all that much on Airflow side of things, but
> this is just an unknown risk to me right now.
>
> Would it be possible to develop this out-of-tree for the time being?
>
> Thanks,
> Ash
>
> > On 18 Oct 2024, at 08:41, Shubham Raj <shubhamraj....@gmail.com> wrote:
> >
> > Hi Jarek, Amogh, and everyone,
> >
> > I wanted to share my thoughts on the proposal to integrate YuniKorn, and
> > I'm definitely on board with it! As others mentioned, adding YuniKorn as
> > another executor could really enhance our scheduling capabilities,
> > especially for the more complex scenarios that Celery and Kubernetes
> > executors struggle with.
> >
> > One of the standout features of YuniKorn is its hierarchical queueing and
> > resource quota management, which is fantastic for handling multi-tenant
> > environments. This will help us keep resource-heavy Airflow tasks from
> > bogging down shared clusters and ensure that resources are allocated
> fairly
> > across different services. Now, regarding gang scheduling as per my
> > understanding, I think it’s interesting to note that Airflow operates on
> a
> > sequential model because of its DAG structure, tasks must wait for their
> > dependencies to finish before they can run. This might seem at odds with
> > the idea of gang scheduling, but there are definitely scenarios where it
> > could be useful. For instance, if we have several independent data
> > processing tasks that need to share resources, gang scheduling could help
> > us optimize resource use and reduce latency by allowing those tasks to be
> > scheduled at the same time.
> >
> > Overall, I believe that integrating these YuniKorn features could really
> > boost Airflow’s capabilities, especially for complex workflows or atleast
> > in resource-constrained environments. Looking forward to hearing
> everyone’s
> > thoughts!
> >
> > Thanks & Regards,
> > Shubham
> >
> > On Fri, Oct 18, 2024 at 10:19 AM Amogh Desai <amoghdesai....@gmail.com>
> > wrote:
> >
> >> Hi Jarek, Everyone,
> >>
> >> Thanks for starting this discussion!
> >> I agree with everyone so far that this will be more of an additional
> >> executor rather than a replacement for
> >> anything we currently have.
> >>
> >> I had submitted a talk that was mainly trying to explain about how we
> can
> >> leverage some features of Yunikorn
> >> such as priority scheduling, multi tenancy (per deployment in terms of
> >> resources) and preemption.
> >> Not all of these features are fully implemented / integrated yet, but I
> had
> >> planned to explore them and share my
> >> findings if my session got selected. I was trying to explore mainly
> around
> >> integration with hierarchical queues
> >> and resource quotas.
> >>
> >> To set a tone, we already have some examples running in our cluster
> >> deployments. We use Airflow in Kubernetes
> >> with theK8sExecutor, where we share space to run Airflow jobs and other
> >> data engineering workloads.
> >>
> >> Via the integration with Yunikorn, we are able to achieve a few things:
> >> 1. Priority Scheduling
> >> We’ve set priorities for different services running in our cluster. For
> >> example, let's say, both Airflow jobs and Spark jobs
> >> run in a cluster. We prioritize Spark Drivers equally with Airflow
> workers,
> >> which ensures that Airflow workers get more
> >> priority over Spark Executors. This way, Airflow schedules won’t be
> missed,
> >> and it doesn’t negatively impact
> >> spark jobs because they can still run with fewer executors.
> >>
> >> 2. Resource Quotas: We also link Airflow namespaces (where the workers
> and
> >> the core services run) with resource quotas
> >> to prevent a malformed or a resource heavy Airflow task from taking over
> >> the entire K8s cluster with a faulty DAG. This is
> >> important since we have both Airflow and other data engineering
> workloads
> >> running together.
> >>
> >> I had a chat with some folks from the Yunikron team and apart from
> this, I
> >> think a few other features of Yunikorn such as
> >> gang scheduling, preemption, etc. could be beneficial to Airflow:
> >> 1. Gang Scheduling
> >> Airflow DAGs generally have a pattern where tasks are dependent on each
> >> other - so lets say task1 -> task2 -> task3 ...
> >> So even though there are so many tasks, there's just one DAG process. So
> >> this could benefit from gang scheduling.
> >> If the whole task set can be considered as a single app and benefit from
> >> gang scheduling. For those of you who
> >> aren't too familiar with gang scheduling, gang scheduling can be
> thought of
> >> as waiting for all your friends to join you
> >> for a game rather than waiting for them one by one (easiest example I
> could
> >> think of).
> >>
> >> 2. Preemption
> >> We can think of different angles to preemption based on the use cases.
> Like
> >> preempting the entire app instead of using a
> >> per request preemption OR not preempting a task if it has a dependent
> task
> >> because preemption is expensive.
> >>
> >> Overall, I believe the community would benefit from this integration,
> and I
> >> think the Yunikorn team will support it as well.
> >>
> >> Thanks & Regards,
> >> Amogh Desai
> >>
> >>
> >> On Thu, Oct 17, 2024 at 11:06 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>
> >>>> As Jens said "K8sExecutor++".
> >>>> Just to be precise, I don't believe that this can be a replacement for
> >>> Celery Executor (at least at first glance).
> >>>
> >>> Yes. Fully agree. My bad framing from the initial message.
> >>>
> >>>> I also believe that for this to be effective, this will need some
> >>> dedicated work including additional information about the task.
> >>>
> >>> Oh absolutely. For me it's more of a (when we agree it's a good
> >> direction)
> >>> - let's keep it as something that **might** eventually happen and not
> in
> >>> 3.0. This is really "if we hear more cases that it might solve, let's
> see
> >>> if we need any changes in current Airflow 3 work to enable it or make
> it
> >>> easier." kinda thing. More like making a mental space for this to
> happen
> >>> when we are discussing other things. Last thing I want to do is to add
> >> more
> >>> substantial work for our 3.0 efforts.
> >>>
> >>>> I am very curious for Amogh to chime in on this :)
> >>>
> >>> Knowing that there was a talk in-preparation, me too :D
> >>>
> >>>> The biggest decision is whether this is a community managed executor
> or
> >>> if we can find stakeholders to create this outside of Airflow (those
> >>> stakeholders could be some of us from the community).
> >>>
> >>> That's an excellent point Niko. Yes. It could be done outside. It could
> >> be
> >>> done by Yunikorn people (unlikely - they likely have more work than
> they
> >>> can handle) or one of the stakeholders (at least initially) - and
> >> published
> >>> and released and battle-tested by them and eventually contributed to
> the
> >>> community. This is I think a very good pattern for Open Source, where
> >>> commercial users might reap the benefits of their investment as "first
> >>> movers" while paying the price for "teething problems" -  but then
> later
> >>> contributing back to the community. A company starting with C and
> ending
> >>> with a comes to my mind immediately as an obvious candidate if you ask
> >> me.
> >>>
> >>> J.
> >>>
> >>>
> >>> On Thu, Oct 17, 2024 at 7:19 PM Oliveira, Niko
> >> <oniko...@amazon.com.invalid
> >>>>
> >>> wrote:
> >>>
> >>>> I love the idea. Generally it is quite easy now to add new executors
> >> and
> >>>> there is no harm in having more options. I don't think we need to
> >> justify
> >>>> it as a replacement of anything honestly.
> >>>>
> >>>> The biggest decision is whether this is a community managed executor
> or
> >>> if
> >>>> we can find stakeholders to create this outside of Airflow (those
> >>>> stakeholders could be some of us from the community).
> >>>>
> >>>> Cheers,
> >>>> Niko
> >>>>
> >>>> ________________________________
> >>>> From: Vikram Koka <vik...@astronomer.io.INVALID>
> >>>> Sent: Wednesday, October 16, 2024 4:13:27 PM
> >>>> To: dev@airflow.apache.org
> >>>> Subject: RE: [EXT] [DISCUSS] Create community "Apache YuniKorn"
> >> executor
> >>> ?
> >>>>
> >>>> CAUTION: This email originated from outside of the organization. Do
> not
> >>>> click links or open attachments unless you can confirm the sender and
> >>> know
> >>>> the content is safe.
> >>>>
> >>>>
> >>>>
> >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> >> externe.
> >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> >>> pouvez
> >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> >>> que
> >>>> le contenu ne présente aucun risque.
> >>>>
> >>>>
> >>>>
> >>>> I am supportive of this in the long term (i.e. post-3.0) as an
> >> additional
> >>>> Executor similar to the Kubernetes Executor.
> >>>> As Jens said "K8sExecutor++".
> >>>>
> >>>> Just to be precise, I don't believe that this can be a replacement for
> >>>> Celery Executor (at least at first glance).
> >>>>
> >>>> I also believe that for this to be effective, this will need some
> >>> dedicated
> >>>> work including additional information about the task.
> >>>> I am very curious for Amogh to chime in on this :)
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> >>>>
> >>>>> Yeah -  it was a bit of dramatisation when I recalled the Celery
> >>>>> "replacement" ;) . And yes it's not really "alternative" to Celery,
> >>>> Celery
> >>>>> is there to stay for short tasks.
> >>>>>
> >>>>> Almost by definition it is meant to run more heavy tasks (for example
> >>>> batch
> >>>>> inference) where multiple tasks running in parallel share the same
> >> GPU
> >>>> for
> >>>>> example - because that's what we want to optimize.
> >>>>>
> >>>>> And yes - it provides features that K8S executor does not - gang
> >>>>> scheduling, and sophisticated preemption logic.
> >>>>>
> >>>>> J.
> >>>>>
> >>>>> On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler
> >>>> <j_scheff...@gmx.de.invalid
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Jarek,
> >>>>>>
> >>>>>> scanning but not reading the full docs I understand that YuniKorn
> >> is
> >>> a
> >>>>>> specialized, more advanced K8sExecutor - all workload also runs in
> >>>> PODs?
> >>>>>>
> >>>>>> If this is the right understanding then it might be a K8sExecutor++
> >>> or
> >>>>>> could replace this... but Celery is playing very good usually if
> >> you
> >>>>>> have very small and high-frequency tasks. Don't know if I
> >>> mis-interpret
> >>>>>> the docs... but would it be scaling down to very small
> >>>>>> PythonOperator/@task decorated tasks with a few lines of code as
> >>> well?
> >>>>>>
> >>>>>> Jens
> >>>>>>
> >>>>>> On 15.10.24 12:55, Jarek Potiuk wrote:
> >>>>>>> Hello here,
> >>>>>>>
> >>>>>>> *Tl;DR; I would love to start discussion about creating (for
> >>> Airflow
> >>>>> 3.x
> >>>>>> -
> >>>>>>> it does not have to be Airflow 3.0) a new community executor
> >> based
> >>> on
> >>>>>>> YuniKorn*
> >>>>>>>
> >>>>>>> You might remember my point "replacing Celery Executor" when I
> >>> raised
> >>>>> the
> >>>>>>> Airflow 3 question. I never actually "meant" to replace (and
> >>> remove)
> >>>>>> Celery
> >>>>>>> Executor, but I was more in a quest to see if we have a viable
> >>>>>> alternative.
> >>>>>>>
> >>>>>>> And I think we have one with Apache Yunicorn.
> >>>>>> https://yunikorn.apache.org/
> >>>>>>>
> >>>>>>> While it is not a direct replacement (so I'd say it should be an
> >>>>>> additional
> >>>>>>> executor), I think Yunikorn can provide us with a number of
> >>> features
> >>>>> that
> >>>>>>> we currently cannot give to our users and from the discussions I
> >>> had
> >>>>> and
> >>>>>>> talk I saw at the Community Over Code in Denver, I believe it
> >> might
> >>>> be
> >>>>>>> something that might make Airflow also more capable especially in
> >>> the
> >>>>>>> "optimization wars" context that I wrote about in
> >>>>>>> https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r
> >>>>>>>
> >>>>>>> It seems like quite a good fit for the "Inference" use case that
> >> we
> >>>>> want
> >>>>>> to
> >>>>>>> support for Airflow 3.
> >>>>>>>
> >>>>>>> At the Community Over Code I attended a talk (and had quite nice
> >>>>>> follow-up
> >>>>>>> discussion) from Apple engineers - named: "Maximizing GPU
> >>>> Utilization:
> >>>>>>> Apache YuniKorn Preemption" and had a very long discussion with
> >>>>> Cloudera
> >>>>>>> people who are using YuniKorn for years to optimize their
> >>> workloads.
> >>>>>>>
> >>>>>>> The presentation is not recorded, but I will try to get slides
> >> and
> >>>> send
> >>>>>> it
> >>>>>>> your way.
> >>>>>>>
> >>>>>>> I think we should take a close look at it  - because it seems to
> >>>> save a
> >>>>>> ton
> >>>>>>> of implementation effort for the Apple team running Batch
> >> inference
> >>>> for
> >>>>>>> their multi-tenant internal environment - which I think is
> >>> precisely
> >>>>> what
> >>>>>>> you want to do.
> >>>>>>>
> >>>>>>> YuniKorn (https://yunikorn.apache.org/) is an "app-aware"
> >>> scheduler
> >>>>> that
> >>>>>>> has a number of queue / capacity management models, policies that
> >>>> allow
> >>>>>>> controlling various applications - competing for GPUs from a
> >> common
> >>>>> pool.
> >>>>>>>
> >>>>>>> They mention things like:
> >>>>>>>
> >>>>>>> * Gang Scheduling / with gang scheduling preemption where there
> >> are
> >>>>>>> workloads requiring minimum number of workers
> >>>>>>> * Supports Latency sensitive workloads
> >>>>>>> * Resource quota management - things like priorities of execution
> >>>>>>> * YuniKorn preemption - with guaranteed capacity and preemption
> >>> when
> >>>>>> needed
> >>>>>>> - which improves the utilisation
> >>>>>>> * Preemption that minimizes preemption cost (Pod level preemption
> >>>>> rather
> >>>>>>> than application level preemption) - very customizable preemption
> >>>> with
> >>>>>>> opt-in/opt-out, queues, resource weights, fencing, supporting
> >>>> fifo/lifo
> >>>>>>> sorting etc.
> >>>>>>> * Runs in Cloud and on-premise
> >>>>>>>
> >>>>>>> The talk described quite a few scenarios of
> >> preemption/utilization/
> >>>>>>> guaranteed resources etc. They also outlined on what YuniKorn
> >> works
> >>>> on
> >>>>>> new
> >>>>>>> features (intra-queue preemption etc.) and what future things can
> >>> be
> >>>>>> done.
> >>>>>>>
> >>>>>>>
> >>>>>>> Coincidentally - Amogh Desai with a friend submitted a talk for
> >>>> Airflow
> >>>>>>> Summit:
> >>>>>>>
> >>>>>>> "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn"
> >>>>>>>
> >>>>>>> Which did not make it to the Summit (other talk of Amogh did) -
> >>> but I
> >>>>>> think
> >>>>>>> back then we have not realized about the potential of utilising
> >>>>> YuniKorn
> >>>>>> to
> >>>>>>> optimize workflows managed by Airflow.
> >>>>>>>
> >>>>>>> But we seem to have people in the community who know more about
> >>>>> YuniKorn
> >>>>>> <>
> >>>>>>> Airflow relation (Amogh :) ) and could probably comment and add
> >>> some
> >>>>>> "from
> >>>>>>> the trenches" experience to the discussion.
> >>>>>>>
> >>>>>>> Here is the description of the talk that Amoghs submitted:
> >>>>>>>
> >>>>>>> Multi-tenant Airflow is hard and there have been novel approaches
> >>> in
> >>>>> the
> >>>>>>> recent past to converge this gap. A key obstacle in multi-tenant
> >>>>> Airflow
> >>>>>> is
> >>>>>>> the management of cluster resources. This is crucial to avoid one
> >>>>>> malformed
> >>>>>>> workload from hijacking an entire cluster. It is also vital to
> >>>> restrict
> >>>>>>> users and groups from monopolizing resources in a shared cluster
> >>>> using
> >>>>>>> their workloads.
> >>>>>>>
> >>>>>>> To tackle these challenges, we turn to Apache YuniKorn, a K8s
> >>>> scheduler
> >>>>>>> catering all kinds of workloads. We leverage YuniKorn’s
> >>> hierarchical
> >>>>>> queues
> >>>>>>> in conjunction with resource quotas to establish multi-tenancy at
> >>>> both
> >>>>>> the
> >>>>>>> shared namespace level and within individual namespaces where
> >>> Airflow
> >>>>> is
> >>>>>>> deployed.
> >>>>>>>
> >>>>>>> YuniKorn also introduces Airflow to a new dimension of
> >> preemption.
> >>>> Now,
> >>>>>>> Airflow workers can preempt resources from lower-priority jobs,
> >>>>> ensuring
> >>>>>>> critical schedules in our data pipelines are met without
> >>> compromise.
> >>>>>>>
> >>>>>>> Join us for a discussion on integrating Airflow with YuniKorn,
> >>>>> unraveling
> >>>>>>> solutions to these multi-tenancy challenges. We will also share
> >> our
> >>>>> past
> >>>>>>> experiences while scaling Airflow and the steps we have taken to
> >>>> handle
> >>>>>>> real world production challenges in equitable multi-tenant K8s
> >>>>> clusters.
> >>>>>>>
> >>>>>>> I would love to hear what you think about it. I know we are deep
> >>> into
> >>>>>>> Airflow 3.0 implementation - but that one can be
> >>>> discussed/implemented
> >>>>>>> independently and maybe it's a good idea to start doing it
> >> earlier
> >>>> than
> >>>>>>> later if we see that it has good potential.
> >>>>>>>
> >>>>>>> J.
> >>>>>>>
> >>>>>>
> >>>>>>
> >> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Reply via email to