Re: [DISCUSS] Create community "Apache YuniKorn" executor ?

Vikram Koka Wed, 16 Oct 2024 16:13:50 -0700

I am supportive of this in the long term (i.e. post-3.0) as an additional
Executor similar to the Kubernetes Executor.
As Jens said "K8sExecutor++".


Just to be precise, I don't believe that this can be a replacement for
Celery Executor (at least at first glance).

I also believe that for this to be effective, this will need some dedicated
work including additional information about the task.
I am very curious for Amogh to chime in on this :)



On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yeah -  it was a bit of dramatisation when I recalled the Celery
> "replacement" ;) . And yes it's not really "alternative" to Celery, Celery
> is there to stay for short tasks.
>
> Almost by definition it is meant to run more heavy tasks (for example batch
> inference) where multiple tasks running in parallel share the same GPU for
> example - because that's what we want to optimize.
>
> And yes - it provides features that K8S executor does not - gang
> scheduling, and sophisticated preemption logic.
>
> J.
>
> On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler <j_scheff...@gmx.de.invalid
> >
> wrote:
>
> > Hi Jarek,
> >
> > scanning but not reading the full docs I understand that YuniKorn is a
> > specialized, more advanced K8sExecutor - all workload also runs in PODs?
> >
> > If this is the right understanding then it might be a K8sExecutor++ or
> > could replace this... but Celery is playing very good usually if you
> > have very small and high-frequency tasks. Don't know if I mis-interpret
> > the docs... but would it be scaling down to very small
> > PythonOperator/@task decorated tasks with a few lines of code as well?
> >
> > Jens
> >
> > On 15.10.24 12:55, Jarek Potiuk wrote:
> > > Hello here,
> > >
> > > *Tl;DR; I would love to start discussion about creating (for Airflow
> 3.x
> > -
> > > it does not have to be Airflow 3.0) a new community executor based on
> > > YuniKorn*
> > >
> > > You might remember my point "replacing Celery Executor" when I raised
> the
> > > Airflow 3 question. I never actually "meant" to replace (and remove)
> > Celery
> > > Executor, but I was more in a quest to see if we have a viable
> > alternative.
> > >
> > > And I think we have one with Apache Yunicorn.
> > https://yunikorn.apache.org/
> > >
> > > While it is not a direct replacement (so I'd say it should be an
> > additional
> > > executor), I think Yunikorn can provide us with a number of features
> that
> > > we currently cannot give to our users and from the discussions I had
> and
> > > talk I saw at the Community Over Code in Denver, I believe it might be
> > > something that might make Airflow also more capable especially in the
> > > "optimization wars" context that I wrote about in
> > > https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r
> > >
> > > It seems like quite a good fit for the "Inference" use case that we
> want
> > to
> > > support for Airflow 3.
> > >
> > > At the Community Over Code I attended a talk (and had quite nice
> > follow-up
> > > discussion) from Apple engineers - named: "Maximizing GPU Utilization:
> > > Apache YuniKorn Preemption" and had a very long discussion with
> Cloudera
> > > people who are using YuniKorn for years to optimize their workloads.
> > >
> > > The presentation is not recorded, but I will try to get slides and send
> > it
> > > your way.
> > >
> > > I think we should take a close look at it  - because it seems to save a
> > ton
> > > of implementation effort for the Apple team running Batch inference for
> > > their multi-tenant internal environment - which I think is precisely
> what
> > > you want to do.
> > >
> > > YuniKorn (https://yunikorn.apache.org/) is an "app-aware" scheduler
> that
> > > has a number of queue / capacity management models, policies that allow
> > > controlling various applications - competing for GPUs from a common
> pool.
> > >
> > > They mention things like:
> > >
> > > * Gang Scheduling / with gang scheduling preemption where there are
> > > workloads requiring minimum number of workers
> > > * Supports Latency sensitive workloads
> > > * Resource quota management - things like priorities of execution
> > > * YuniKorn preemption - with guaranteed capacity and preemption when
> > needed
> > > - which improves the utilisation
> > > * Preemption that minimizes preemption cost (Pod level preemption
> rather
> > > than application level preemption) - very customizable preemption with
> > > opt-in/opt-out, queues, resource weights, fencing, supporting fifo/lifo
> > > sorting etc.
> > > * Runs in Cloud and on-premise
> > >
> > > The talk described quite a few scenarios of preemption/utilization/
> > > guaranteed resources etc. They also outlined on what YuniKorn works on
> > new
> > > features (intra-queue preemption etc.) and what future things can be
> > done.
> > >
> > >
> > > Coincidentally - Amogh Desai with a friend submitted a talk for Airflow
> > > Summit:
> > >
> > > "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn"
> > >
> > > Which did not make it to the Summit (other talk of Amogh did) - but I
> > think
> > > back then we have not realized about the potential of utilising
> YuniKorn
> > to
> > > optimize workflows managed by Airflow.
> > >
> > > But we seem to have people in the community who know more about
> YuniKorn
> > <>
> > > Airflow relation (Amogh :) ) and could probably comment and add some
> > "from
> > > the trenches" experience to the discussion.
> > >
> > > Here is the description of the talk that Amoghs submitted:
> > >
> > > Multi-tenant Airflow is hard and there have been novel approaches in
> the
> > > recent past to converge this gap. A key obstacle in multi-tenant
> Airflow
> > is
> > > the management of cluster resources. This is crucial to avoid one
> > malformed
> > > workload from hijacking an entire cluster. It is also vital to restrict
> > > users and groups from monopolizing resources in a shared cluster using
> > > their workloads.
> > >
> > > To tackle these challenges, we turn to Apache YuniKorn, a K8s scheduler
> > > catering all kinds of workloads. We leverage YuniKorn’s hierarchical
> > queues
> > > in conjunction with resource quotas to establish multi-tenancy at both
> > the
> > > shared namespace level and within individual namespaces where Airflow
> is
> > > deployed.
> > >
> > > YuniKorn also introduces Airflow to a new dimension of preemption. Now,
> > > Airflow workers can preempt resources from lower-priority jobs,
> ensuring
> > > critical schedules in our data pipelines are met without compromise.
> > >
> > > Join us for a discussion on integrating Airflow with YuniKorn,
> unraveling
> > > solutions to these multi-tenancy challenges. We will also share our
> past
> > > experiences while scaling Airflow and the steps we have taken to handle
> > > real world production challenges in equitable multi-tenant K8s
> clusters.
> > >
> > > I would love to hear what you think about it. I know we are deep into
> > > Airflow 3.0 implementation - but that one can be discussed/implemented
> > > independently and maybe it's a good idea to start doing it earlier than
> > > later if we see that it has good potential.
> > >
> > > J.
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> >
>

Re: [DISCUSS] Create community "Apache YuniKorn" executor ?

Reply via email to