Great to hear from you, Mani. I am interested in collaborating with you on this one. Seems like a promising initial demo, yet to catch up on the specifics.
Thanks & Regards, Amogh Desai On Tue, Oct 22, 2024 at 8:56 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > Would it be possible to develop this out-of-tree for the time being? > > Oh absolutely. I definitely do not want to add "more" on the Airflow 3 > band-wagon. > I am even quite sure I will not be the one implementing it, nor anyone > involved in Airflow 3, It's more of a conceptual discussion - and an > attempt to make it as an interesting idea someone could take a closer look > at. > > Manikandan, > > Great to hear from you, it's fantastic to hear from maintainers of other > projects. But for your information - we are now in the process of > completely rewriting part of Airflow 3, which means that we are deep down > and busy in various parts - and as Ash mentioned, we do not want to > "complicate" things but adding more **just now**. > > And I really love the idea of starting something outside in parallel. There > are a number of people here who are not that deeply involved in Airflow 3 > and they could take on the discussion - and maybe even work with Manikandan > directly on such an executor - somewhere on the side - just to explore more > of what Manikandan just started. > > I'd be curious to see the result of such a work where someone with deeper > Airflow understanding (but not necessarily involved in Airflow 3 work) > could make some self-guided experiment and look at what could be achieved > even by developing an executor POC that would work for Airflow 2 ? > > Maybe someone ? > > > J. > > On Tue, Oct 22, 2024 at 1:22 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > > This looks like it has some really cool features. > > > > > *Tl;DR; I would love to start discussion about creating (for Airflow > 3.x > > - > > it does not have to be Airflow 3.0) a new community executor based on > > YuniKorn* > > > > I think this caveat to me is the main point, as long as it’s not in 3.0 > > (and ideally for me not even in repo for the next few months) for two > > reasons: > > > > 1. AIP 72 is going to change the Executor interface somewhat, and we > don’t > > know the exact details of how yet, so having to not worry about another > > executor to fix up and ensure works would be good to now slow down > > development of 3.0; and > > 2. I’m slightly nervous about the extra support load of a new executor at > > this time. It’s probably not all that much on Airflow side of things, but > > this is just an unknown risk to me right now. > > > > Would it be possible to develop this out-of-tree for the time being? > > > > Thanks, > > Ash > > > > > On 18 Oct 2024, at 08:41, Shubham Raj <shubhamraj....@gmail.com> > wrote: > > > > > > Hi Jarek, Amogh, and everyone, > > > > > > I wanted to share my thoughts on the proposal to integrate YuniKorn, > and > > > I'm definitely on board with it! As others mentioned, adding YuniKorn > as > > > another executor could really enhance our scheduling capabilities, > > > especially for the more complex scenarios that Celery and Kubernetes > > > executors struggle with. > > > > > > One of the standout features of YuniKorn is its hierarchical queueing > and > > > resource quota management, which is fantastic for handling multi-tenant > > > environments. This will help us keep resource-heavy Airflow tasks from > > > bogging down shared clusters and ensure that resources are allocated > > fairly > > > across different services. Now, regarding gang scheduling as per my > > > understanding, I think it’s interesting to note that Airflow operates > on > > a > > > sequential model because of its DAG structure, tasks must wait for > their > > > dependencies to finish before they can run. This might seem at odds > with > > > the idea of gang scheduling, but there are definitely scenarios where > it > > > could be useful. For instance, if we have several independent data > > > processing tasks that need to share resources, gang scheduling could > help > > > us optimize resource use and reduce latency by allowing those tasks to > be > > > scheduled at the same time. > > > > > > Overall, I believe that integrating these YuniKorn features could > really > > > boost Airflow’s capabilities, especially for complex workflows or > atleast > > > in resource-constrained environments. Looking forward to hearing > > everyone’s > > > thoughts! > > > > > > Thanks & Regards, > > > Shubham > > > > > > On Fri, Oct 18, 2024 at 10:19 AM Amogh Desai <amoghdesai....@gmail.com > > > > > wrote: > > > > > >> Hi Jarek, Everyone, > > >> > > >> Thanks for starting this discussion! > > >> I agree with everyone so far that this will be more of an additional > > >> executor rather than a replacement for > > >> anything we currently have. > > >> > > >> I had submitted a talk that was mainly trying to explain about how we > > can > > >> leverage some features of Yunikorn > > >> such as priority scheduling, multi tenancy (per deployment in terms of > > >> resources) and preemption. > > >> Not all of these features are fully implemented / integrated yet, but > I > > had > > >> planned to explore them and share my > > >> findings if my session got selected. I was trying to explore mainly > > around > > >> integration with hierarchical queues > > >> and resource quotas. > > >> > > >> To set a tone, we already have some examples running in our cluster > > >> deployments. We use Airflow in Kubernetes > > >> with theK8sExecutor, where we share space to run Airflow jobs and > other > > >> data engineering workloads. > > >> > > >> Via the integration with Yunikorn, we are able to achieve a few > things: > > >> 1. Priority Scheduling > > >> We’ve set priorities for different services running in our cluster. > For > > >> example, let's say, both Airflow jobs and Spark jobs > > >> run in a cluster. We prioritize Spark Drivers equally with Airflow > > workers, > > >> which ensures that Airflow workers get more > > >> priority over Spark Executors. This way, Airflow schedules won’t be > > missed, > > >> and it doesn’t negatively impact > > >> spark jobs because they can still run with fewer executors. > > >> > > >> 2. Resource Quotas: We also link Airflow namespaces (where the workers > > and > > >> the core services run) with resource quotas > > >> to prevent a malformed or a resource heavy Airflow task from taking > over > > >> the entire K8s cluster with a faulty DAG. This is > > >> important since we have both Airflow and other data engineering > > workloads > > >> running together. > > >> > > >> I had a chat with some folks from the Yunikron team and apart from > > this, I > > >> think a few other features of Yunikorn such as > > >> gang scheduling, preemption, etc. could be beneficial to Airflow: > > >> 1. Gang Scheduling > > >> Airflow DAGs generally have a pattern where tasks are dependent on > each > > >> other - so lets say task1 -> task2 -> task3 ... > > >> So even though there are so many tasks, there's just one DAG process. > So > > >> this could benefit from gang scheduling. > > >> If the whole task set can be considered as a single app and benefit > from > > >> gang scheduling. For those of you who > > >> aren't too familiar with gang scheduling, gang scheduling can be > > thought of > > >> as waiting for all your friends to join you > > >> for a game rather than waiting for them one by one (easiest example I > > could > > >> think of). > > >> > > >> 2. Preemption > > >> We can think of different angles to preemption based on the use cases. > > Like > > >> preempting the entire app instead of using a > > >> per request preemption OR not preempting a task if it has a dependent > > task > > >> because preemption is expensive. > > >> > > >> Overall, I believe the community would benefit from this integration, > > and I > > >> think the Yunikorn team will support it as well. > > >> > > >> Thanks & Regards, > > >> Amogh Desai > > >> > > >> > > >> On Thu, Oct 17, 2024 at 11:06 PM Jarek Potiuk <ja...@potiuk.com> > wrote: > > >> > > >>>> As Jens said "K8sExecutor++". > > >>>> Just to be precise, I don't believe that this can be a replacement > for > > >>> Celery Executor (at least at first glance). > > >>> > > >>> Yes. Fully agree. My bad framing from the initial message. > > >>> > > >>>> I also believe that for this to be effective, this will need some > > >>> dedicated work including additional information about the task. > > >>> > > >>> Oh absolutely. For me it's more of a (when we agree it's a good > > >> direction) > > >>> - let's keep it as something that **might** eventually happen and not > > in > > >>> 3.0. This is really "if we hear more cases that it might solve, let's > > see > > >>> if we need any changes in current Airflow 3 work to enable it or make > > it > > >>> easier." kinda thing. More like making a mental space for this to > > happen > > >>> when we are discussing other things. Last thing I want to do is to > add > > >> more > > >>> substantial work for our 3.0 efforts. > > >>> > > >>>> I am very curious for Amogh to chime in on this :) > > >>> > > >>> Knowing that there was a talk in-preparation, me too :D > > >>> > > >>>> The biggest decision is whether this is a community managed executor > > or > > >>> if we can find stakeholders to create this outside of Airflow (those > > >>> stakeholders could be some of us from the community). > > >>> > > >>> That's an excellent point Niko. Yes. It could be done outside. It > could > > >> be > > >>> done by Yunikorn people (unlikely - they likely have more work than > > they > > >>> can handle) or one of the stakeholders (at least initially) - and > > >> published > > >>> and released and battle-tested by them and eventually contributed to > > the > > >>> community. This is I think a very good pattern for Open Source, where > > >>> commercial users might reap the benefits of their investment as > "first > > >>> movers" while paying the price for "teething problems" - but then > > later > > >>> contributing back to the community. A company starting with C and > > ending > > >>> with a comes to my mind immediately as an obvious candidate if you > ask > > >> me. > > >>> > > >>> J. > > >>> > > >>> > > >>> On Thu, Oct 17, 2024 at 7:19 PM Oliveira, Niko > > >> <oniko...@amazon.com.invalid > > >>>> > > >>> wrote: > > >>> > > >>>> I love the idea. Generally it is quite easy now to add new executors > > >> and > > >>>> there is no harm in having more options. I don't think we need to > > >> justify > > >>>> it as a replacement of anything honestly. > > >>>> > > >>>> The biggest decision is whether this is a community managed executor > > or > > >>> if > > >>>> we can find stakeholders to create this outside of Airflow (those > > >>>> stakeholders could be some of us from the community). > > >>>> > > >>>> Cheers, > > >>>> Niko > > >>>> > > >>>> ________________________________ > > >>>> From: Vikram Koka <vik...@astronomer.io.INVALID> > > >>>> Sent: Wednesday, October 16, 2024 4:13:27 PM > > >>>> To: dev@airflow.apache.org > > >>>> Subject: RE: [EXT] [DISCUSS] Create community "Apache YuniKorn" > > >> executor > > >>> ? > > >>>> > > >>>> CAUTION: This email originated from outside of the organization. Do > > not > > >>>> click links or open attachments unless you can confirm the sender > and > > >>> know > > >>>> the content is safe. > > >>>> > > >>>> > > >>>> > > >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > > >> externe. > > >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > > >>> pouvez > > >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas > certain > > >>> que > > >>>> le contenu ne présente aucun risque. > > >>>> > > >>>> > > >>>> > > >>>> I am supportive of this in the long term (i.e. post-3.0) as an > > >> additional > > >>>> Executor similar to the Kubernetes Executor. > > >>>> As Jens said "K8sExecutor++". > > >>>> > > >>>> Just to be precise, I don't believe that this can be a replacement > for > > >>>> Celery Executor (at least at first glance). > > >>>> > > >>>> I also believe that for this to be effective, this will need some > > >>> dedicated > > >>>> work including additional information about the task. > > >>>> I am very curious for Amogh to chime in on this :) > > >>>> > > >>>> > > >>>> > > >>>> On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com> > > wrote: > > >>>> > > >>>>> Yeah - it was a bit of dramatisation when I recalled the Celery > > >>>>> "replacement" ;) . And yes it's not really "alternative" to Celery, > > >>>> Celery > > >>>>> is there to stay for short tasks. > > >>>>> > > >>>>> Almost by definition it is meant to run more heavy tasks (for > example > > >>>> batch > > >>>>> inference) where multiple tasks running in parallel share the same > > >> GPU > > >>>> for > > >>>>> example - because that's what we want to optimize. > > >>>>> > > >>>>> And yes - it provides features that K8S executor does not - gang > > >>>>> scheduling, and sophisticated preemption logic. > > >>>>> > > >>>>> J. > > >>>>> > > >>>>> On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler > > >>>> <j_scheff...@gmx.de.invalid > > >>>>>> > > >>>>> wrote: > > >>>>> > > >>>>>> Hi Jarek, > > >>>>>> > > >>>>>> scanning but not reading the full docs I understand that YuniKorn > > >> is > > >>> a > > >>>>>> specialized, more advanced K8sExecutor - all workload also runs in > > >>>> PODs? > > >>>>>> > > >>>>>> If this is the right understanding then it might be a > K8sExecutor++ > > >>> or > > >>>>>> could replace this... but Celery is playing very good usually if > > >> you > > >>>>>> have very small and high-frequency tasks. Don't know if I > > >>> mis-interpret > > >>>>>> the docs... but would it be scaling down to very small > > >>>>>> PythonOperator/@task decorated tasks with a few lines of code as > > >>> well? > > >>>>>> > > >>>>>> Jens > > >>>>>> > > >>>>>> On 15.10.24 12:55, Jarek Potiuk wrote: > > >>>>>>> Hello here, > > >>>>>>> > > >>>>>>> *Tl;DR; I would love to start discussion about creating (for > > >>> Airflow > > >>>>> 3.x > > >>>>>> - > > >>>>>>> it does not have to be Airflow 3.0) a new community executor > > >> based > > >>> on > > >>>>>>> YuniKorn* > > >>>>>>> > > >>>>>>> You might remember my point "replacing Celery Executor" when I > > >>> raised > > >>>>> the > > >>>>>>> Airflow 3 question. I never actually "meant" to replace (and > > >>> remove) > > >>>>>> Celery > > >>>>>>> Executor, but I was more in a quest to see if we have a viable > > >>>>>> alternative. > > >>>>>>> > > >>>>>>> And I think we have one with Apache Yunicorn. > > >>>>>> https://yunikorn.apache.org/ > > >>>>>>> > > >>>>>>> While it is not a direct replacement (so I'd say it should be an > > >>>>>> additional > > >>>>>>> executor), I think Yunikorn can provide us with a number of > > >>> features > > >>>>> that > > >>>>>>> we currently cannot give to our users and from the discussions I > > >>> had > > >>>>> and > > >>>>>>> talk I saw at the Community Over Code in Denver, I believe it > > >> might > > >>>> be > > >>>>>>> something that might make Airflow also more capable especially in > > >>> the > > >>>>>>> "optimization wars" context that I wrote about in > > >>>>>>> https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r > > >>>>>>> > > >>>>>>> It seems like quite a good fit for the "Inference" use case that > > >> we > > >>>>> want > > >>>>>> to > > >>>>>>> support for Airflow 3. > > >>>>>>> > > >>>>>>> At the Community Over Code I attended a talk (and had quite nice > > >>>>>> follow-up > > >>>>>>> discussion) from Apple engineers - named: "Maximizing GPU > > >>>> Utilization: > > >>>>>>> Apache YuniKorn Preemption" and had a very long discussion with > > >>>>> Cloudera > > >>>>>>> people who are using YuniKorn for years to optimize their > > >>> workloads. > > >>>>>>> > > >>>>>>> The presentation is not recorded, but I will try to get slides > > >> and > > >>>> send > > >>>>>> it > > >>>>>>> your way. > > >>>>>>> > > >>>>>>> I think we should take a close look at it - because it seems to > > >>>> save a > > >>>>>> ton > > >>>>>>> of implementation effort for the Apple team running Batch > > >> inference > > >>>> for > > >>>>>>> their multi-tenant internal environment - which I think is > > >>> precisely > > >>>>> what > > >>>>>>> you want to do. > > >>>>>>> > > >>>>>>> YuniKorn (https://yunikorn.apache.org/) is an "app-aware" > > >>> scheduler > > >>>>> that > > >>>>>>> has a number of queue / capacity management models, policies that > > >>>> allow > > >>>>>>> controlling various applications - competing for GPUs from a > > >> common > > >>>>> pool. > > >>>>>>> > > >>>>>>> They mention things like: > > >>>>>>> > > >>>>>>> * Gang Scheduling / with gang scheduling preemption where there > > >> are > > >>>>>>> workloads requiring minimum number of workers > > >>>>>>> * Supports Latency sensitive workloads > > >>>>>>> * Resource quota management - things like priorities of execution > > >>>>>>> * YuniKorn preemption - with guaranteed capacity and preemption > > >>> when > > >>>>>> needed > > >>>>>>> - which improves the utilisation > > >>>>>>> * Preemption that minimizes preemption cost (Pod level preemption > > >>>>> rather > > >>>>>>> than application level preemption) - very customizable preemption > > >>>> with > > >>>>>>> opt-in/opt-out, queues, resource weights, fencing, supporting > > >>>> fifo/lifo > > >>>>>>> sorting etc. > > >>>>>>> * Runs in Cloud and on-premise > > >>>>>>> > > >>>>>>> The talk described quite a few scenarios of > > >> preemption/utilization/ > > >>>>>>> guaranteed resources etc. They also outlined on what YuniKorn > > >> works > > >>>> on > > >>>>>> new > > >>>>>>> features (intra-queue preemption etc.) and what future things can > > >>> be > > >>>>>> done. > > >>>>>>> > > >>>>>>> > > >>>>>>> Coincidentally - Amogh Desai with a friend submitted a talk for > > >>>> Airflow > > >>>>>>> Summit: > > >>>>>>> > > >>>>>>> "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn" > > >>>>>>> > > >>>>>>> Which did not make it to the Summit (other talk of Amogh did) - > > >>> but I > > >>>>>> think > > >>>>>>> back then we have not realized about the potential of utilising > > >>>>> YuniKorn > > >>>>>> to > > >>>>>>> optimize workflows managed by Airflow. > > >>>>>>> > > >>>>>>> But we seem to have people in the community who know more about > > >>>>> YuniKorn > > >>>>>> <> > > >>>>>>> Airflow relation (Amogh :) ) and could probably comment and add > > >>> some > > >>>>>> "from > > >>>>>>> the trenches" experience to the discussion. > > >>>>>>> > > >>>>>>> Here is the description of the talk that Amoghs submitted: > > >>>>>>> > > >>>>>>> Multi-tenant Airflow is hard and there have been novel approaches > > >>> in > > >>>>> the > > >>>>>>> recent past to converge this gap. A key obstacle in multi-tenant > > >>>>> Airflow > > >>>>>> is > > >>>>>>> the management of cluster resources. This is crucial to avoid one > > >>>>>> malformed > > >>>>>>> workload from hijacking an entire cluster. It is also vital to > > >>>> restrict > > >>>>>>> users and groups from monopolizing resources in a shared cluster > > >>>> using > > >>>>>>> their workloads. > > >>>>>>> > > >>>>>>> To tackle these challenges, we turn to Apache YuniKorn, a K8s > > >>>> scheduler > > >>>>>>> catering all kinds of workloads. We leverage YuniKorn’s > > >>> hierarchical > > >>>>>> queues > > >>>>>>> in conjunction with resource quotas to establish multi-tenancy at > > >>>> both > > >>>>>> the > > >>>>>>> shared namespace level and within individual namespaces where > > >>> Airflow > > >>>>> is > > >>>>>>> deployed. > > >>>>>>> > > >>>>>>> YuniKorn also introduces Airflow to a new dimension of > > >> preemption. > > >>>> Now, > > >>>>>>> Airflow workers can preempt resources from lower-priority jobs, > > >>>>> ensuring > > >>>>>>> critical schedules in our data pipelines are met without > > >>> compromise. > > >>>>>>> > > >>>>>>> Join us for a discussion on integrating Airflow with YuniKorn, > > >>>>> unraveling > > >>>>>>> solutions to these multi-tenancy challenges. We will also share > > >> our > > >>>>> past > > >>>>>>> experiences while scaling Airflow and the steps we have taken to > > >>>> handle > > >>>>>>> real world production challenges in equitable multi-tenant K8s > > >>>>> clusters. > > >>>>>>> > > >>>>>>> I would love to hear what you think about it. I know we are deep > > >>> into > > >>>>>>> Airflow 3.0 implementation - but that one can be > > >>>> discussed/implemented > > >>>>>>> independently and maybe it's a good idea to start doing it > > >> earlier > > >>>> than > > >>>>>>> later if we see that it has good potential. > > >>>>>>> > > >>>>>>> J. > > >>>>>>> > > >>>>>> > > >>>>>> > > >> --------------------------------------------------------------------- > > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org > > >>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > >