I am supportive of this in the long term (i.e. post-3.0) as an additional Executor similar to the Kubernetes Executor. As Jens said "K8sExecutor++".
Just to be precise, I don't believe that this can be a replacement for Celery Executor (at least at first glance). I also believe that for this to be effective, this will need some dedicated work including additional information about the task. I am very curious for Amogh to chime in on this :) On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com> wrote: > Yeah - it was a bit of dramatisation when I recalled the Celery > "replacement" ;) . And yes it's not really "alternative" to Celery, Celery > is there to stay for short tasks. > > Almost by definition it is meant to run more heavy tasks (for example batch > inference) where multiple tasks running in parallel share the same GPU for > example - because that's what we want to optimize. > > And yes - it provides features that K8S executor does not - gang > scheduling, and sophisticated preemption logic. > > J. > > On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler <j_scheff...@gmx.de.invalid > > > wrote: > > > Hi Jarek, > > > > scanning but not reading the full docs I understand that YuniKorn is a > > specialized, more advanced K8sExecutor - all workload also runs in PODs? > > > > If this is the right understanding then it might be a K8sExecutor++ or > > could replace this... but Celery is playing very good usually if you > > have very small and high-frequency tasks. Don't know if I mis-interpret > > the docs... but would it be scaling down to very small > > PythonOperator/@task decorated tasks with a few lines of code as well? > > > > Jens > > > > On 15.10.24 12:55, Jarek Potiuk wrote: > > > Hello here, > > > > > > *Tl;DR; I would love to start discussion about creating (for Airflow > 3.x > > - > > > it does not have to be Airflow 3.0) a new community executor based on > > > YuniKorn* > > > > > > You might remember my point "replacing Celery Executor" when I raised > the > > > Airflow 3 question. I never actually "meant" to replace (and remove) > > Celery > > > Executor, but I was more in a quest to see if we have a viable > > alternative. > > > > > > And I think we have one with Apache Yunicorn. > > https://yunikorn.apache.org/ > > > > > > While it is not a direct replacement (so I'd say it should be an > > additional > > > executor), I think Yunikorn can provide us with a number of features > that > > > we currently cannot give to our users and from the discussions I had > and > > > talk I saw at the Community Over Code in Denver, I believe it might be > > > something that might make Airflow also more capable especially in the > > > "optimization wars" context that I wrote about in > > > https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r > > > > > > It seems like quite a good fit for the "Inference" use case that we > want > > to > > > support for Airflow 3. > > > > > > At the Community Over Code I attended a talk (and had quite nice > > follow-up > > > discussion) from Apple engineers - named: "Maximizing GPU Utilization: > > > Apache YuniKorn Preemption" and had a very long discussion with > Cloudera > > > people who are using YuniKorn for years to optimize their workloads. > > > > > > The presentation is not recorded, but I will try to get slides and send > > it > > > your way. > > > > > > I think we should take a close look at it - because it seems to save a > > ton > > > of implementation effort for the Apple team running Batch inference for > > > their multi-tenant internal environment - which I think is precisely > what > > > you want to do. > > > > > > YuniKorn (https://yunikorn.apache.org/) is an "app-aware" scheduler > that > > > has a number of queue / capacity management models, policies that allow > > > controlling various applications - competing for GPUs from a common > pool. > > > > > > They mention things like: > > > > > > * Gang Scheduling / with gang scheduling preemption where there are > > > workloads requiring minimum number of workers > > > * Supports Latency sensitive workloads > > > * Resource quota management - things like priorities of execution > > > * YuniKorn preemption - with guaranteed capacity and preemption when > > needed > > > - which improves the utilisation > > > * Preemption that minimizes preemption cost (Pod level preemption > rather > > > than application level preemption) - very customizable preemption with > > > opt-in/opt-out, queues, resource weights, fencing, supporting fifo/lifo > > > sorting etc. > > > * Runs in Cloud and on-premise > > > > > > The talk described quite a few scenarios of preemption/utilization/ > > > guaranteed resources etc. They also outlined on what YuniKorn works on > > new > > > features (intra-queue preemption etc.) and what future things can be > > done. > > > > > > > > > Coincidentally - Amogh Desai with a friend submitted a talk for Airflow > > > Summit: > > > > > > "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn" > > > > > > Which did not make it to the Summit (other talk of Amogh did) - but I > > think > > > back then we have not realized about the potential of utilising > YuniKorn > > to > > > optimize workflows managed by Airflow. > > > > > > But we seem to have people in the community who know more about > YuniKorn > > <> > > > Airflow relation (Amogh :) ) and could probably comment and add some > > "from > > > the trenches" experience to the discussion. > > > > > > Here is the description of the talk that Amoghs submitted: > > > > > > Multi-tenant Airflow is hard and there have been novel approaches in > the > > > recent past to converge this gap. A key obstacle in multi-tenant > Airflow > > is > > > the management of cluster resources. This is crucial to avoid one > > malformed > > > workload from hijacking an entire cluster. It is also vital to restrict > > > users and groups from monopolizing resources in a shared cluster using > > > their workloads. > > > > > > To tackle these challenges, we turn to Apache YuniKorn, a K8s scheduler > > > catering all kinds of workloads. We leverage YuniKorn’s hierarchical > > queues > > > in conjunction with resource quotas to establish multi-tenancy at both > > the > > > shared namespace level and within individual namespaces where Airflow > is > > > deployed. > > > > > > YuniKorn also introduces Airflow to a new dimension of preemption. Now, > > > Airflow workers can preempt resources from lower-priority jobs, > ensuring > > > critical schedules in our data pipelines are met without compromise. > > > > > > Join us for a discussion on integrating Airflow with YuniKorn, > unraveling > > > solutions to these multi-tenancy challenges. We will also share our > past > > > experiences while scaling Airflow and the steps we have taken to handle > > > real world production challenges in equitable multi-tenant K8s > clusters. > > > > > > I would love to hear what you think about it. I know we are deep into > > > Airflow 3.0 implementation - but that one can be discussed/implemented > > > independently and maybe it's a good idea to start doing it earlier than > > > later if we see that it has good potential. > > > > > > J. > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > >