I would be glad to see some early stages  :)

J.

On Fri, Oct 25, 2024 at 3:17 PM Amogh Desai <amoghdesai....@gmail.com>
wrote:

> Great to hear from you, Mani.
>
> I am interested in collaborating with you on this one.
> Seems like a promising initial demo, yet to catch up on the specifics.
>
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Tue, Oct 22, 2024 at 8:56 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > > Would it be possible to develop this out-of-tree for the time being?
> >
> > Oh absolutely. I definitely do not want to add "more" on the Airflow 3
> > band-wagon.
> > I am even quite sure I will not be the one implementing it, nor anyone
> > involved in Airflow 3, It's more of a conceptual discussion - and an
> > attempt to make it as an interesting idea someone could take a closer
> look
> > at.
> >
> > Manikandan,
> >
> > Great to hear from you, it's fantastic to hear from maintainers of other
> > projects. But for your information -  we are now in the process of
> > completely rewriting part of Airflow 3, which means that we are deep down
> > and busy in various parts - and as Ash mentioned, we do not want to
> > "complicate" things but adding more **just now**.
> >
> > And I really love the idea of starting something outside in parallel.
> There
> > are a number of people here who are not that deeply involved in Airflow 3
> > and they could take on the discussion - and maybe even work with
> Manikandan
> > directly on such an executor - somewhere on the side - just to explore
> more
> > of what Manikandan just started.
> >
> > I'd be curious to see the result of such a work where someone with deeper
> > Airflow understanding (but not necessarily involved in Airflow 3 work)
> > could make some self-guided experiment and look at what could be achieved
> > even by developing an executor POC that would work for Airflow 2 ?
> >
> > Maybe someone ?
> >
> >
> > J.
> >
> > On Tue, Oct 22, 2024 at 1:22 PM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >
> > > This looks like it has some really cool features.
> > >
> > > > *Tl;DR; I would love to start discussion about creating (for Airflow
> > 3.x
> > > -
> > > it does not have to be Airflow 3.0) a new community executor based on
> > > YuniKorn*
> > >
> > > I think this caveat to me is the main point, as long as it’s not in 3.0
> > > (and ideally for me not even in repo for the next few months) for two
> > > reasons:
> > >
> > > 1. AIP 72 is going to change the Executor interface somewhat, and we
> > don’t
> > > know the exact details of how yet, so having to not worry about another
> > > executor to fix up and ensure works would be good to now slow down
> > > development of 3.0; and
> > > 2. I’m slightly nervous about the extra support load of a new executor
> at
> > > this time. It’s probably not all that much on Airflow side of things,
> but
> > > this is just an unknown risk to me right now.
> > >
> > > Would it be possible to develop this out-of-tree for the time being?
> > >
> > > Thanks,
> > > Ash
> > >
> > > > On 18 Oct 2024, at 08:41, Shubham Raj <shubhamraj....@gmail.com>
> > wrote:
> > > >
> > > > Hi Jarek, Amogh, and everyone,
> > > >
> > > > I wanted to share my thoughts on the proposal to integrate YuniKorn,
> > and
> > > > I'm definitely on board with it! As others mentioned, adding YuniKorn
> > as
> > > > another executor could really enhance our scheduling capabilities,
> > > > especially for the more complex scenarios that Celery and Kubernetes
> > > > executors struggle with.
> > > >
> > > > One of the standout features of YuniKorn is its hierarchical queueing
> > and
> > > > resource quota management, which is fantastic for handling
> multi-tenant
> > > > environments. This will help us keep resource-heavy Airflow tasks
> from
> > > > bogging down shared clusters and ensure that resources are allocated
> > > fairly
> > > > across different services. Now, regarding gang scheduling as per my
> > > > understanding, I think it’s interesting to note that Airflow operates
> > on
> > > a
> > > > sequential model because of its DAG structure, tasks must wait for
> > their
> > > > dependencies to finish before they can run. This might seem at odds
> > with
> > > > the idea of gang scheduling, but there are definitely scenarios where
> > it
> > > > could be useful. For instance, if we have several independent data
> > > > processing tasks that need to share resources, gang scheduling could
> > help
> > > > us optimize resource use and reduce latency by allowing those tasks
> to
> > be
> > > > scheduled at the same time.
> > > >
> > > > Overall, I believe that integrating these YuniKorn features could
> > really
> > > > boost Airflow’s capabilities, especially for complex workflows or
> > atleast
> > > > in resource-constrained environments. Looking forward to hearing
> > > everyone’s
> > > > thoughts!
> > > >
> > > > Thanks & Regards,
> > > > Shubham
> > > >
> > > > On Fri, Oct 18, 2024 at 10:19 AM Amogh Desai <
> amoghdesai....@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Jarek, Everyone,
> > > >>
> > > >> Thanks for starting this discussion!
> > > >> I agree with everyone so far that this will be more of an additional
> > > >> executor rather than a replacement for
> > > >> anything we currently have.
> > > >>
> > > >> I had submitted a talk that was mainly trying to explain about how
> we
> > > can
> > > >> leverage some features of Yunikorn
> > > >> such as priority scheduling, multi tenancy (per deployment in terms
> of
> > > >> resources) and preemption.
> > > >> Not all of these features are fully implemented / integrated yet,
> but
> > I
> > > had
> > > >> planned to explore them and share my
> > > >> findings if my session got selected. I was trying to explore mainly
> > > around
> > > >> integration with hierarchical queues
> > > >> and resource quotas.
> > > >>
> > > >> To set a tone, we already have some examples running in our cluster
> > > >> deployments. We use Airflow in Kubernetes
> > > >> with theK8sExecutor, where we share space to run Airflow jobs and
> > other
> > > >> data engineering workloads.
> > > >>
> > > >> Via the integration with Yunikorn, we are able to achieve a few
> > things:
> > > >> 1. Priority Scheduling
> > > >> We’ve set priorities for different services running in our cluster.
> > For
> > > >> example, let's say, both Airflow jobs and Spark jobs
> > > >> run in a cluster. We prioritize Spark Drivers equally with Airflow
> > > workers,
> > > >> which ensures that Airflow workers get more
> > > >> priority over Spark Executors. This way, Airflow schedules won’t be
> > > missed,
> > > >> and it doesn’t negatively impact
> > > >> spark jobs because they can still run with fewer executors.
> > > >>
> > > >> 2. Resource Quotas: We also link Airflow namespaces (where the
> workers
> > > and
> > > >> the core services run) with resource quotas
> > > >> to prevent a malformed or a resource heavy Airflow task from taking
> > over
> > > >> the entire K8s cluster with a faulty DAG. This is
> > > >> important since we have both Airflow and other data engineering
> > > workloads
> > > >> running together.
> > > >>
> > > >> I had a chat with some folks from the Yunikron team and apart from
> > > this, I
> > > >> think a few other features of Yunikorn such as
> > > >> gang scheduling, preemption, etc. could be beneficial to Airflow:
> > > >> 1. Gang Scheduling
> > > >> Airflow DAGs generally have a pattern where tasks are dependent on
> > each
> > > >> other - so lets say task1 -> task2 -> task3 ...
> > > >> So even though there are so many tasks, there's just one DAG
> process.
> > So
> > > >> this could benefit from gang scheduling.
> > > >> If the whole task set can be considered as a single app and benefit
> > from
> > > >> gang scheduling. For those of you who
> > > >> aren't too familiar with gang scheduling, gang scheduling can be
> > > thought of
> > > >> as waiting for all your friends to join you
> > > >> for a game rather than waiting for them one by one (easiest example
> I
> > > could
> > > >> think of).
> > > >>
> > > >> 2. Preemption
> > > >> We can think of different angles to preemption based on the use
> cases.
> > > Like
> > > >> preempting the entire app instead of using a
> > > >> per request preemption OR not preempting a task if it has a
> dependent
> > > task
> > > >> because preemption is expensive.
> > > >>
> > > >> Overall, I believe the community would benefit from this
> integration,
> > > and I
> > > >> think the Yunikorn team will support it as well.
> > > >>
> > > >> Thanks & Regards,
> > > >> Amogh Desai
> > > >>
> > > >>
> > > >> On Thu, Oct 17, 2024 at 11:06 PM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > > >>
> > > >>>> As Jens said "K8sExecutor++".
> > > >>>> Just to be precise, I don't believe that this can be a replacement
> > for
> > > >>> Celery Executor (at least at first glance).
> > > >>>
> > > >>> Yes. Fully agree. My bad framing from the initial message.
> > > >>>
> > > >>>> I also believe that for this to be effective, this will need some
> > > >>> dedicated work including additional information about the task.
> > > >>>
> > > >>> Oh absolutely. For me it's more of a (when we agree it's a good
> > > >> direction)
> > > >>> - let's keep it as something that **might** eventually happen and
> not
> > > in
> > > >>> 3.0. This is really "if we hear more cases that it might solve,
> let's
> > > see
> > > >>> if we need any changes in current Airflow 3 work to enable it or
> make
> > > it
> > > >>> easier." kinda thing. More like making a mental space for this to
> > > happen
> > > >>> when we are discussing other things. Last thing I want to do is to
> > add
> > > >> more
> > > >>> substantial work for our 3.0 efforts.
> > > >>>
> > > >>>> I am very curious for Amogh to chime in on this :)
> > > >>>
> > > >>> Knowing that there was a talk in-preparation, me too :D
> > > >>>
> > > >>>> The biggest decision is whether this is a community managed
> executor
> > > or
> > > >>> if we can find stakeholders to create this outside of Airflow
> (those
> > > >>> stakeholders could be some of us from the community).
> > > >>>
> > > >>> That's an excellent point Niko. Yes. It could be done outside. It
> > could
> > > >> be
> > > >>> done by Yunikorn people (unlikely - they likely have more work than
> > > they
> > > >>> can handle) or one of the stakeholders (at least initially) - and
> > > >> published
> > > >>> and released and battle-tested by them and eventually contributed
> to
> > > the
> > > >>> community. This is I think a very good pattern for Open Source,
> where
> > > >>> commercial users might reap the benefits of their investment as
> > "first
> > > >>> movers" while paying the price for "teething problems" -  but then
> > > later
> > > >>> contributing back to the community. A company starting with C and
> > > ending
> > > >>> with a comes to my mind immediately as an obvious candidate if you
> > ask
> > > >> me.
> > > >>>
> > > >>> J.
> > > >>>
> > > >>>
> > > >>> On Thu, Oct 17, 2024 at 7:19 PM Oliveira, Niko
> > > >> <oniko...@amazon.com.invalid
> > > >>>>
> > > >>> wrote:
> > > >>>
> > > >>>> I love the idea. Generally it is quite easy now to add new
> executors
> > > >> and
> > > >>>> there is no harm in having more options. I don't think we need to
> > > >> justify
> > > >>>> it as a replacement of anything honestly.
> > > >>>>
> > > >>>> The biggest decision is whether this is a community managed
> executor
> > > or
> > > >>> if
> > > >>>> we can find stakeholders to create this outside of Airflow (those
> > > >>>> stakeholders could be some of us from the community).
> > > >>>>
> > > >>>> Cheers,
> > > >>>> Niko
> > > >>>>
> > > >>>> ________________________________
> > > >>>> From: Vikram Koka <vik...@astronomer.io.INVALID>
> > > >>>> Sent: Wednesday, October 16, 2024 4:13:27 PM
> > > >>>> To: dev@airflow.apache.org
> > > >>>> Subject: RE: [EXT] [DISCUSS] Create community "Apache YuniKorn"
> > > >> executor
> > > >>> ?
> > > >>>>
> > > >>>> CAUTION: This email originated from outside of the organization.
> Do
> > > not
> > > >>>> click links or open attachments unless you can confirm the sender
> > and
> > > >>> know
> > > >>>> the content is safe.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > >> externe.
> > > >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous
> ne
> > > >>> pouvez
> > > >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > certain
> > > >>> que
> > > >>>> le contenu ne présente aucun risque.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> I am supportive of this in the long term (i.e. post-3.0) as an
> > > >> additional
> > > >>>> Executor similar to the Kubernetes Executor.
> > > >>>> As Jens said "K8sExecutor++".
> > > >>>>
> > > >>>> Just to be precise, I don't believe that this can be a replacement
> > for
> > > >>>> Celery Executor (at least at first glance).
> > > >>>>
> > > >>>> I also believe that for this to be effective, this will need some
> > > >>> dedicated
> > > >>>> work including additional information about the task.
> > > >>>> I am very curious for Amogh to chime in on this :)
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com>
> > > wrote:
> > > >>>>
> > > >>>>> Yeah -  it was a bit of dramatisation when I recalled the Celery
> > > >>>>> "replacement" ;) . And yes it's not really "alternative" to
> Celery,
> > > >>>> Celery
> > > >>>>> is there to stay for short tasks.
> > > >>>>>
> > > >>>>> Almost by definition it is meant to run more heavy tasks (for
> > example
> > > >>>> batch
> > > >>>>> inference) where multiple tasks running in parallel share the
> same
> > > >> GPU
> > > >>>> for
> > > >>>>> example - because that's what we want to optimize.
> > > >>>>>
> > > >>>>> And yes - it provides features that K8S executor does not - gang
> > > >>>>> scheduling, and sophisticated preemption logic.
> > > >>>>>
> > > >>>>> J.
> > > >>>>>
> > > >>>>> On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler
> > > >>>> <j_scheff...@gmx.de.invalid
> > > >>>>>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Jarek,
> > > >>>>>>
> > > >>>>>> scanning but not reading the full docs I understand that
> YuniKorn
> > > >> is
> > > >>> a
> > > >>>>>> specialized, more advanced K8sExecutor - all workload also runs
> in
> > > >>>> PODs?
> > > >>>>>>
> > > >>>>>> If this is the right understanding then it might be a
> > K8sExecutor++
> > > >>> or
> > > >>>>>> could replace this... but Celery is playing very good usually if
> > > >> you
> > > >>>>>> have very small and high-frequency tasks. Don't know if I
> > > >>> mis-interpret
> > > >>>>>> the docs... but would it be scaling down to very small
> > > >>>>>> PythonOperator/@task decorated tasks with a few lines of code as
> > > >>> well?
> > > >>>>>>
> > > >>>>>> Jens
> > > >>>>>>
> > > >>>>>> On 15.10.24 12:55, Jarek Potiuk wrote:
> > > >>>>>>> Hello here,
> > > >>>>>>>
> > > >>>>>>> *Tl;DR; I would love to start discussion about creating (for
> > > >>> Airflow
> > > >>>>> 3.x
> > > >>>>>> -
> > > >>>>>>> it does not have to be Airflow 3.0) a new community executor
> > > >> based
> > > >>> on
> > > >>>>>>> YuniKorn*
> > > >>>>>>>
> > > >>>>>>> You might remember my point "replacing Celery Executor" when I
> > > >>> raised
> > > >>>>> the
> > > >>>>>>> Airflow 3 question. I never actually "meant" to replace (and
> > > >>> remove)
> > > >>>>>> Celery
> > > >>>>>>> Executor, but I was more in a quest to see if we have a viable
> > > >>>>>> alternative.
> > > >>>>>>>
> > > >>>>>>> And I think we have one with Apache Yunicorn.
> > > >>>>>> https://yunikorn.apache.org/
> > > >>>>>>>
> > > >>>>>>> While it is not a direct replacement (so I'd say it should be
> an
> > > >>>>>> additional
> > > >>>>>>> executor), I think Yunikorn can provide us with a number of
> > > >>> features
> > > >>>>> that
> > > >>>>>>> we currently cannot give to our users and from the discussions
> I
> > > >>> had
> > > >>>>> and
> > > >>>>>>> talk I saw at the Community Over Code in Denver, I believe it
> > > >> might
> > > >>>> be
> > > >>>>>>> something that might make Airflow also more capable especially
> in
> > > >>> the
> > > >>>>>>> "optimization wars" context that I wrote about in
> > > >>>>>>>
> https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r
> > > >>>>>>>
> > > >>>>>>> It seems like quite a good fit for the "Inference" use case
> that
> > > >> we
> > > >>>>> want
> > > >>>>>> to
> > > >>>>>>> support for Airflow 3.
> > > >>>>>>>
> > > >>>>>>> At the Community Over Code I attended a talk (and had quite
> nice
> > > >>>>>> follow-up
> > > >>>>>>> discussion) from Apple engineers - named: "Maximizing GPU
> > > >>>> Utilization:
> > > >>>>>>> Apache YuniKorn Preemption" and had a very long discussion with
> > > >>>>> Cloudera
> > > >>>>>>> people who are using YuniKorn for years to optimize their
> > > >>> workloads.
> > > >>>>>>>
> > > >>>>>>> The presentation is not recorded, but I will try to get slides
> > > >> and
> > > >>>> send
> > > >>>>>> it
> > > >>>>>>> your way.
> > > >>>>>>>
> > > >>>>>>> I think we should take a close look at it  - because it seems
> to
> > > >>>> save a
> > > >>>>>> ton
> > > >>>>>>> of implementation effort for the Apple team running Batch
> > > >> inference
> > > >>>> for
> > > >>>>>>> their multi-tenant internal environment - which I think is
> > > >>> precisely
> > > >>>>> what
> > > >>>>>>> you want to do.
> > > >>>>>>>
> > > >>>>>>> YuniKorn (https://yunikorn.apache.org/) is an "app-aware"
> > > >>> scheduler
> > > >>>>> that
> > > >>>>>>> has a number of queue / capacity management models, policies
> that
> > > >>>> allow
> > > >>>>>>> controlling various applications - competing for GPUs from a
> > > >> common
> > > >>>>> pool.
> > > >>>>>>>
> > > >>>>>>> They mention things like:
> > > >>>>>>>
> > > >>>>>>> * Gang Scheduling / with gang scheduling preemption where there
> > > >> are
> > > >>>>>>> workloads requiring minimum number of workers
> > > >>>>>>> * Supports Latency sensitive workloads
> > > >>>>>>> * Resource quota management - things like priorities of
> execution
> > > >>>>>>> * YuniKorn preemption - with guaranteed capacity and preemption
> > > >>> when
> > > >>>>>> needed
> > > >>>>>>> - which improves the utilisation
> > > >>>>>>> * Preemption that minimizes preemption cost (Pod level
> preemption
> > > >>>>> rather
> > > >>>>>>> than application level preemption) - very customizable
> preemption
> > > >>>> with
> > > >>>>>>> opt-in/opt-out, queues, resource weights, fencing, supporting
> > > >>>> fifo/lifo
> > > >>>>>>> sorting etc.
> > > >>>>>>> * Runs in Cloud and on-premise
> > > >>>>>>>
> > > >>>>>>> The talk described quite a few scenarios of
> > > >> preemption/utilization/
> > > >>>>>>> guaranteed resources etc. They also outlined on what YuniKorn
> > > >> works
> > > >>>> on
> > > >>>>>> new
> > > >>>>>>> features (intra-queue preemption etc.) and what future things
> can
> > > >>> be
> > > >>>>>> done.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Coincidentally - Amogh Desai with a friend submitted a talk for
> > > >>>> Airflow
> > > >>>>>>> Summit:
> > > >>>>>>>
> > > >>>>>>> "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn"
> > > >>>>>>>
> > > >>>>>>> Which did not make it to the Summit (other talk of Amogh did) -
> > > >>> but I
> > > >>>>>> think
> > > >>>>>>> back then we have not realized about the potential of utilising
> > > >>>>> YuniKorn
> > > >>>>>> to
> > > >>>>>>> optimize workflows managed by Airflow.
> > > >>>>>>>
> > > >>>>>>> But we seem to have people in the community who know more about
> > > >>>>> YuniKorn
> > > >>>>>> <>
> > > >>>>>>> Airflow relation (Amogh :) ) and could probably comment and add
> > > >>> some
> > > >>>>>> "from
> > > >>>>>>> the trenches" experience to the discussion.
> > > >>>>>>>
> > > >>>>>>> Here is the description of the talk that Amoghs submitted:
> > > >>>>>>>
> > > >>>>>>> Multi-tenant Airflow is hard and there have been novel
> approaches
> > > >>> in
> > > >>>>> the
> > > >>>>>>> recent past to converge this gap. A key obstacle in
> multi-tenant
> > > >>>>> Airflow
> > > >>>>>> is
> > > >>>>>>> the management of cluster resources. This is crucial to avoid
> one
> > > >>>>>> malformed
> > > >>>>>>> workload from hijacking an entire cluster. It is also vital to
> > > >>>> restrict
> > > >>>>>>> users and groups from monopolizing resources in a shared
> cluster
> > > >>>> using
> > > >>>>>>> their workloads.
> > > >>>>>>>
> > > >>>>>>> To tackle these challenges, we turn to Apache YuniKorn, a K8s
> > > >>>> scheduler
> > > >>>>>>> catering all kinds of workloads. We leverage YuniKorn’s
> > > >>> hierarchical
> > > >>>>>> queues
> > > >>>>>>> in conjunction with resource quotas to establish multi-tenancy
> at
> > > >>>> both
> > > >>>>>> the
> > > >>>>>>> shared namespace level and within individual namespaces where
> > > >>> Airflow
> > > >>>>> is
> > > >>>>>>> deployed.
> > > >>>>>>>
> > > >>>>>>> YuniKorn also introduces Airflow to a new dimension of
> > > >> preemption.
> > > >>>> Now,
> > > >>>>>>> Airflow workers can preempt resources from lower-priority jobs,
> > > >>>>> ensuring
> > > >>>>>>> critical schedules in our data pipelines are met without
> > > >>> compromise.
> > > >>>>>>>
> > > >>>>>>> Join us for a discussion on integrating Airflow with YuniKorn,
> > > >>>>> unraveling
> > > >>>>>>> solutions to these multi-tenancy challenges. We will also share
> > > >> our
> > > >>>>> past
> > > >>>>>>> experiences while scaling Airflow and the steps we have taken
> to
> > > >>>> handle
> > > >>>>>>> real world production challenges in equitable multi-tenant K8s
> > > >>>>> clusters.
> > > >>>>>>>
> > > >>>>>>> I would love to hear what you think about it. I know we are
> deep
> > > >>> into
> > > >>>>>>> Airflow 3.0 implementation - but that one can be
> > > >>>> discussed/implemented
> > > >>>>>>> independently and maybe it's a good idea to start doing it
> > > >> earlier
> > > >>>> than
> > > >>>>>>> later if we see that it has good potential.
> > > >>>>>>>
> > > >>>>>>> J.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>
> ---------------------------------------------------------------------
> > > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > >
> > >
> >
>

Reply via email to