Great to hear from you, Mani.

I am interested in collaborating with you on this one.
Seems like a promising initial demo, yet to catch up on the specifics.


Thanks & Regards,
Amogh Desai


On Tue, Oct 22, 2024 at 8:56 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> > Would it be possible to develop this out-of-tree for the time being?
>
> Oh absolutely. I definitely do not want to add "more" on the Airflow 3
> band-wagon.
> I am even quite sure I will not be the one implementing it, nor anyone
> involved in Airflow 3, It's more of a conceptual discussion - and an
> attempt to make it as an interesting idea someone could take a closer look
> at.
>
> Manikandan,
>
> Great to hear from you, it's fantastic to hear from maintainers of other
> projects. But for your information -  we are now in the process of
> completely rewriting part of Airflow 3, which means that we are deep down
> and busy in various parts - and as Ash mentioned, we do not want to
> "complicate" things but adding more **just now**.
>
> And I really love the idea of starting something outside in parallel. There
> are a number of people here who are not that deeply involved in Airflow 3
> and they could take on the discussion - and maybe even work with Manikandan
> directly on such an executor - somewhere on the side - just to explore more
> of what Manikandan just started.
>
> I'd be curious to see the result of such a work where someone with deeper
> Airflow understanding (but not necessarily involved in Airflow 3 work)
> could make some self-guided experiment and look at what could be achieved
> even by developing an executor POC that would work for Airflow 2 ?
>
> Maybe someone ?
>
>
> J.
>
> On Tue, Oct 22, 2024 at 1:22 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> > This looks like it has some really cool features.
> >
> > > *Tl;DR; I would love to start discussion about creating (for Airflow
> 3.x
> > -
> > it does not have to be Airflow 3.0) a new community executor based on
> > YuniKorn*
> >
> > I think this caveat to me is the main point, as long as it’s not in 3.0
> > (and ideally for me not even in repo for the next few months) for two
> > reasons:
> >
> > 1. AIP 72 is going to change the Executor interface somewhat, and we
> don’t
> > know the exact details of how yet, so having to not worry about another
> > executor to fix up and ensure works would be good to now slow down
> > development of 3.0; and
> > 2. I’m slightly nervous about the extra support load of a new executor at
> > this time. It’s probably not all that much on Airflow side of things, but
> > this is just an unknown risk to me right now.
> >
> > Would it be possible to develop this out-of-tree for the time being?
> >
> > Thanks,
> > Ash
> >
> > > On 18 Oct 2024, at 08:41, Shubham Raj <shubhamraj....@gmail.com>
> wrote:
> > >
> > > Hi Jarek, Amogh, and everyone,
> > >
> > > I wanted to share my thoughts on the proposal to integrate YuniKorn,
> and
> > > I'm definitely on board with it! As others mentioned, adding YuniKorn
> as
> > > another executor could really enhance our scheduling capabilities,
> > > especially for the more complex scenarios that Celery and Kubernetes
> > > executors struggle with.
> > >
> > > One of the standout features of YuniKorn is its hierarchical queueing
> and
> > > resource quota management, which is fantastic for handling multi-tenant
> > > environments. This will help us keep resource-heavy Airflow tasks from
> > > bogging down shared clusters and ensure that resources are allocated
> > fairly
> > > across different services. Now, regarding gang scheduling as per my
> > > understanding, I think it’s interesting to note that Airflow operates
> on
> > a
> > > sequential model because of its DAG structure, tasks must wait for
> their
> > > dependencies to finish before they can run. This might seem at odds
> with
> > > the idea of gang scheduling, but there are definitely scenarios where
> it
> > > could be useful. For instance, if we have several independent data
> > > processing tasks that need to share resources, gang scheduling could
> help
> > > us optimize resource use and reduce latency by allowing those tasks to
> be
> > > scheduled at the same time.
> > >
> > > Overall, I believe that integrating these YuniKorn features could
> really
> > > boost Airflow’s capabilities, especially for complex workflows or
> atleast
> > > in resource-constrained environments. Looking forward to hearing
> > everyone’s
> > > thoughts!
> > >
> > > Thanks & Regards,
> > > Shubham
> > >
> > > On Fri, Oct 18, 2024 at 10:19 AM Amogh Desai <amoghdesai....@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Jarek, Everyone,
> > >>
> > >> Thanks for starting this discussion!
> > >> I agree with everyone so far that this will be more of an additional
> > >> executor rather than a replacement for
> > >> anything we currently have.
> > >>
> > >> I had submitted a talk that was mainly trying to explain about how we
> > can
> > >> leverage some features of Yunikorn
> > >> such as priority scheduling, multi tenancy (per deployment in terms of
> > >> resources) and preemption.
> > >> Not all of these features are fully implemented / integrated yet, but
> I
> > had
> > >> planned to explore them and share my
> > >> findings if my session got selected. I was trying to explore mainly
> > around
> > >> integration with hierarchical queues
> > >> and resource quotas.
> > >>
> > >> To set a tone, we already have some examples running in our cluster
> > >> deployments. We use Airflow in Kubernetes
> > >> with theK8sExecutor, where we share space to run Airflow jobs and
> other
> > >> data engineering workloads.
> > >>
> > >> Via the integration with Yunikorn, we are able to achieve a few
> things:
> > >> 1. Priority Scheduling
> > >> We’ve set priorities for different services running in our cluster.
> For
> > >> example, let's say, both Airflow jobs and Spark jobs
> > >> run in a cluster. We prioritize Spark Drivers equally with Airflow
> > workers,
> > >> which ensures that Airflow workers get more
> > >> priority over Spark Executors. This way, Airflow schedules won’t be
> > missed,
> > >> and it doesn’t negatively impact
> > >> spark jobs because they can still run with fewer executors.
> > >>
> > >> 2. Resource Quotas: We also link Airflow namespaces (where the workers
> > and
> > >> the core services run) with resource quotas
> > >> to prevent a malformed or a resource heavy Airflow task from taking
> over
> > >> the entire K8s cluster with a faulty DAG. This is
> > >> important since we have both Airflow and other data engineering
> > workloads
> > >> running together.
> > >>
> > >> I had a chat with some folks from the Yunikron team and apart from
> > this, I
> > >> think a few other features of Yunikorn such as
> > >> gang scheduling, preemption, etc. could be beneficial to Airflow:
> > >> 1. Gang Scheduling
> > >> Airflow DAGs generally have a pattern where tasks are dependent on
> each
> > >> other - so lets say task1 -> task2 -> task3 ...
> > >> So even though there are so many tasks, there's just one DAG process.
> So
> > >> this could benefit from gang scheduling.
> > >> If the whole task set can be considered as a single app and benefit
> from
> > >> gang scheduling. For those of you who
> > >> aren't too familiar with gang scheduling, gang scheduling can be
> > thought of
> > >> as waiting for all your friends to join you
> > >> for a game rather than waiting for them one by one (easiest example I
> > could
> > >> think of).
> > >>
> > >> 2. Preemption
> > >> We can think of different angles to preemption based on the use cases.
> > Like
> > >> preempting the entire app instead of using a
> > >> per request preemption OR not preempting a task if it has a dependent
> > task
> > >> because preemption is expensive.
> > >>
> > >> Overall, I believe the community would benefit from this integration,
> > and I
> > >> think the Yunikorn team will support it as well.
> > >>
> > >> Thanks & Regards,
> > >> Amogh Desai
> > >>
> > >>
> > >> On Thu, Oct 17, 2024 at 11:06 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > >>
> > >>>> As Jens said "K8sExecutor++".
> > >>>> Just to be precise, I don't believe that this can be a replacement
> for
> > >>> Celery Executor (at least at first glance).
> > >>>
> > >>> Yes. Fully agree. My bad framing from the initial message.
> > >>>
> > >>>> I also believe that for this to be effective, this will need some
> > >>> dedicated work including additional information about the task.
> > >>>
> > >>> Oh absolutely. For me it's more of a (when we agree it's a good
> > >> direction)
> > >>> - let's keep it as something that **might** eventually happen and not
> > in
> > >>> 3.0. This is really "if we hear more cases that it might solve, let's
> > see
> > >>> if we need any changes in current Airflow 3 work to enable it or make
> > it
> > >>> easier." kinda thing. More like making a mental space for this to
> > happen
> > >>> when we are discussing other things. Last thing I want to do is to
> add
> > >> more
> > >>> substantial work for our 3.0 efforts.
> > >>>
> > >>>> I am very curious for Amogh to chime in on this :)
> > >>>
> > >>> Knowing that there was a talk in-preparation, me too :D
> > >>>
> > >>>> The biggest decision is whether this is a community managed executor
> > or
> > >>> if we can find stakeholders to create this outside of Airflow (those
> > >>> stakeholders could be some of us from the community).
> > >>>
> > >>> That's an excellent point Niko. Yes. It could be done outside. It
> could
> > >> be
> > >>> done by Yunikorn people (unlikely - they likely have more work than
> > they
> > >>> can handle) or one of the stakeholders (at least initially) - and
> > >> published
> > >>> and released and battle-tested by them and eventually contributed to
> > the
> > >>> community. This is I think a very good pattern for Open Source, where
> > >>> commercial users might reap the benefits of their investment as
> "first
> > >>> movers" while paying the price for "teething problems" -  but then
> > later
> > >>> contributing back to the community. A company starting with C and
> > ending
> > >>> with a comes to my mind immediately as an obvious candidate if you
> ask
> > >> me.
> > >>>
> > >>> J.
> > >>>
> > >>>
> > >>> On Thu, Oct 17, 2024 at 7:19 PM Oliveira, Niko
> > >> <oniko...@amazon.com.invalid
> > >>>>
> > >>> wrote:
> > >>>
> > >>>> I love the idea. Generally it is quite easy now to add new executors
> > >> and
> > >>>> there is no harm in having more options. I don't think we need to
> > >> justify
> > >>>> it as a replacement of anything honestly.
> > >>>>
> > >>>> The biggest decision is whether this is a community managed executor
> > or
> > >>> if
> > >>>> we can find stakeholders to create this outside of Airflow (those
> > >>>> stakeholders could be some of us from the community).
> > >>>>
> > >>>> Cheers,
> > >>>> Niko
> > >>>>
> > >>>> ________________________________
> > >>>> From: Vikram Koka <vik...@astronomer.io.INVALID>
> > >>>> Sent: Wednesday, October 16, 2024 4:13:27 PM
> > >>>> To: dev@airflow.apache.org
> > >>>> Subject: RE: [EXT] [DISCUSS] Create community "Apache YuniKorn"
> > >> executor
> > >>> ?
> > >>>>
> > >>>> CAUTION: This email originated from outside of the organization. Do
> > not
> > >>>> click links or open attachments unless you can confirm the sender
> and
> > >>> know
> > >>>> the content is safe.
> > >>>>
> > >>>>
> > >>>>
> > >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > >> externe.
> > >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > >>> pouvez
> > >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> certain
> > >>> que
> > >>>> le contenu ne présente aucun risque.
> > >>>>
> > >>>>
> > >>>>
> > >>>> I am supportive of this in the long term (i.e. post-3.0) as an
> > >> additional
> > >>>> Executor similar to the Kubernetes Executor.
> > >>>> As Jens said "K8sExecutor++".
> > >>>>
> > >>>> Just to be precise, I don't believe that this can be a replacement
> for
> > >>>> Celery Executor (at least at first glance).
> > >>>>
> > >>>> I also believe that for this to be effective, this will need some
> > >>> dedicated
> > >>>> work including additional information about the task.
> > >>>> I am very curious for Amogh to chime in on this :)
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > >>>>
> > >>>>> Yeah -  it was a bit of dramatisation when I recalled the Celery
> > >>>>> "replacement" ;) . And yes it's not really "alternative" to Celery,
> > >>>> Celery
> > >>>>> is there to stay for short tasks.
> > >>>>>
> > >>>>> Almost by definition it is meant to run more heavy tasks (for
> example
> > >>>> batch
> > >>>>> inference) where multiple tasks running in parallel share the same
> > >> GPU
> > >>>> for
> > >>>>> example - because that's what we want to optimize.
> > >>>>>
> > >>>>> And yes - it provides features that K8S executor does not - gang
> > >>>>> scheduling, and sophisticated preemption logic.
> > >>>>>
> > >>>>> J.
> > >>>>>
> > >>>>> On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler
> > >>>> <j_scheff...@gmx.de.invalid
> > >>>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Jarek,
> > >>>>>>
> > >>>>>> scanning but not reading the full docs I understand that YuniKorn
> > >> is
> > >>> a
> > >>>>>> specialized, more advanced K8sExecutor - all workload also runs in
> > >>>> PODs?
> > >>>>>>
> > >>>>>> If this is the right understanding then it might be a
> K8sExecutor++
> > >>> or
> > >>>>>> could replace this... but Celery is playing very good usually if
> > >> you
> > >>>>>> have very small and high-frequency tasks. Don't know if I
> > >>> mis-interpret
> > >>>>>> the docs... but would it be scaling down to very small
> > >>>>>> PythonOperator/@task decorated tasks with a few lines of code as
> > >>> well?
> > >>>>>>
> > >>>>>> Jens
> > >>>>>>
> > >>>>>> On 15.10.24 12:55, Jarek Potiuk wrote:
> > >>>>>>> Hello here,
> > >>>>>>>
> > >>>>>>> *Tl;DR; I would love to start discussion about creating (for
> > >>> Airflow
> > >>>>> 3.x
> > >>>>>> -
> > >>>>>>> it does not have to be Airflow 3.0) a new community executor
> > >> based
> > >>> on
> > >>>>>>> YuniKorn*
> > >>>>>>>
> > >>>>>>> You might remember my point "replacing Celery Executor" when I
> > >>> raised
> > >>>>> the
> > >>>>>>> Airflow 3 question. I never actually "meant" to replace (and
> > >>> remove)
> > >>>>>> Celery
> > >>>>>>> Executor, but I was more in a quest to see if we have a viable
> > >>>>>> alternative.
> > >>>>>>>
> > >>>>>>> And I think we have one with Apache Yunicorn.
> > >>>>>> https://yunikorn.apache.org/
> > >>>>>>>
> > >>>>>>> While it is not a direct replacement (so I'd say it should be an
> > >>>>>> additional
> > >>>>>>> executor), I think Yunikorn can provide us with a number of
> > >>> features
> > >>>>> that
> > >>>>>>> we currently cannot give to our users and from the discussions I
> > >>> had
> > >>>>> and
> > >>>>>>> talk I saw at the Community Over Code in Denver, I believe it
> > >> might
> > >>>> be
> > >>>>>>> something that might make Airflow also more capable especially in
> > >>> the
> > >>>>>>> "optimization wars" context that I wrote about in
> > >>>>>>> https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r
> > >>>>>>>
> > >>>>>>> It seems like quite a good fit for the "Inference" use case that
> > >> we
> > >>>>> want
> > >>>>>> to
> > >>>>>>> support for Airflow 3.
> > >>>>>>>
> > >>>>>>> At the Community Over Code I attended a talk (and had quite nice
> > >>>>>> follow-up
> > >>>>>>> discussion) from Apple engineers - named: "Maximizing GPU
> > >>>> Utilization:
> > >>>>>>> Apache YuniKorn Preemption" and had a very long discussion with
> > >>>>> Cloudera
> > >>>>>>> people who are using YuniKorn for years to optimize their
> > >>> workloads.
> > >>>>>>>
> > >>>>>>> The presentation is not recorded, but I will try to get slides
> > >> and
> > >>>> send
> > >>>>>> it
> > >>>>>>> your way.
> > >>>>>>>
> > >>>>>>> I think we should take a close look at it  - because it seems to
> > >>>> save a
> > >>>>>> ton
> > >>>>>>> of implementation effort for the Apple team running Batch
> > >> inference
> > >>>> for
> > >>>>>>> their multi-tenant internal environment - which I think is
> > >>> precisely
> > >>>>> what
> > >>>>>>> you want to do.
> > >>>>>>>
> > >>>>>>> YuniKorn (https://yunikorn.apache.org/) is an "app-aware"
> > >>> scheduler
> > >>>>> that
> > >>>>>>> has a number of queue / capacity management models, policies that
> > >>>> allow
> > >>>>>>> controlling various applications - competing for GPUs from a
> > >> common
> > >>>>> pool.
> > >>>>>>>
> > >>>>>>> They mention things like:
> > >>>>>>>
> > >>>>>>> * Gang Scheduling / with gang scheduling preemption where there
> > >> are
> > >>>>>>> workloads requiring minimum number of workers
> > >>>>>>> * Supports Latency sensitive workloads
> > >>>>>>> * Resource quota management - things like priorities of execution
> > >>>>>>> * YuniKorn preemption - with guaranteed capacity and preemption
> > >>> when
> > >>>>>> needed
> > >>>>>>> - which improves the utilisation
> > >>>>>>> * Preemption that minimizes preemption cost (Pod level preemption
> > >>>>> rather
> > >>>>>>> than application level preemption) - very customizable preemption
> > >>>> with
> > >>>>>>> opt-in/opt-out, queues, resource weights, fencing, supporting
> > >>>> fifo/lifo
> > >>>>>>> sorting etc.
> > >>>>>>> * Runs in Cloud and on-premise
> > >>>>>>>
> > >>>>>>> The talk described quite a few scenarios of
> > >> preemption/utilization/
> > >>>>>>> guaranteed resources etc. They also outlined on what YuniKorn
> > >> works
> > >>>> on
> > >>>>>> new
> > >>>>>>> features (intra-queue preemption etc.) and what future things can
> > >>> be
> > >>>>>> done.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Coincidentally - Amogh Desai with a friend submitted a talk for
> > >>>> Airflow
> > >>>>>>> Summit:
> > >>>>>>>
> > >>>>>>> "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn"
> > >>>>>>>
> > >>>>>>> Which did not make it to the Summit (other talk of Amogh did) -
> > >>> but I
> > >>>>>> think
> > >>>>>>> back then we have not realized about the potential of utilising
> > >>>>> YuniKorn
> > >>>>>> to
> > >>>>>>> optimize workflows managed by Airflow.
> > >>>>>>>
> > >>>>>>> But we seem to have people in the community who know more about
> > >>>>> YuniKorn
> > >>>>>> <>
> > >>>>>>> Airflow relation (Amogh :) ) and could probably comment and add
> > >>> some
> > >>>>>> "from
> > >>>>>>> the trenches" experience to the discussion.
> > >>>>>>>
> > >>>>>>> Here is the description of the talk that Amoghs submitted:
> > >>>>>>>
> > >>>>>>> Multi-tenant Airflow is hard and there have been novel approaches
> > >>> in
> > >>>>> the
> > >>>>>>> recent past to converge this gap. A key obstacle in multi-tenant
> > >>>>> Airflow
> > >>>>>> is
> > >>>>>>> the management of cluster resources. This is crucial to avoid one
> > >>>>>> malformed
> > >>>>>>> workload from hijacking an entire cluster. It is also vital to
> > >>>> restrict
> > >>>>>>> users and groups from monopolizing resources in a shared cluster
> > >>>> using
> > >>>>>>> their workloads.
> > >>>>>>>
> > >>>>>>> To tackle these challenges, we turn to Apache YuniKorn, a K8s
> > >>>> scheduler
> > >>>>>>> catering all kinds of workloads. We leverage YuniKorn’s
> > >>> hierarchical
> > >>>>>> queues
> > >>>>>>> in conjunction with resource quotas to establish multi-tenancy at
> > >>>> both
> > >>>>>> the
> > >>>>>>> shared namespace level and within individual namespaces where
> > >>> Airflow
> > >>>>> is
> > >>>>>>> deployed.
> > >>>>>>>
> > >>>>>>> YuniKorn also introduces Airflow to a new dimension of
> > >> preemption.
> > >>>> Now,
> > >>>>>>> Airflow workers can preempt resources from lower-priority jobs,
> > >>>>> ensuring
> > >>>>>>> critical schedules in our data pipelines are met without
> > >>> compromise.
> > >>>>>>>
> > >>>>>>> Join us for a discussion on integrating Airflow with YuniKorn,
> > >>>>> unraveling
> > >>>>>>> solutions to these multi-tenancy challenges. We will also share
> > >> our
> > >>>>> past
> > >>>>>>> experiences while scaling Airflow and the steps we have taken to
> > >>>> handle
> > >>>>>>> real world production challenges in equitable multi-tenant K8s
> > >>>>> clusters.
> > >>>>>>>
> > >>>>>>> I would love to hear what you think about it. I know we are deep
> > >>> into
> > >>>>>>> Airflow 3.0 implementation - but that one can be
> > >>>> discussed/implemented
> > >>>>>>> independently and maybe it's a good idea to start doing it
> > >> earlier
> > >>>> than
> > >>>>>>> later if we see that it has good potential.
> > >>>>>>>
> > >>>>>>> J.
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >> ---------------------------------------------------------------------
> > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> >
>

Reply via email to