I would be glad to see some early stages :) J.
On Fri, Oct 25, 2024 at 3:17 PM Amogh Desai <amoghdesai....@gmail.com> wrote: > Great to hear from you, Mani. > > I am interested in collaborating with you on this one. > Seems like a promising initial demo, yet to catch up on the specifics. > > > Thanks & Regards, > Amogh Desai > > > On Tue, Oct 22, 2024 at 8:56 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > Would it be possible to develop this out-of-tree for the time being? > > > > Oh absolutely. I definitely do not want to add "more" on the Airflow 3 > > band-wagon. > > I am even quite sure I will not be the one implementing it, nor anyone > > involved in Airflow 3, It's more of a conceptual discussion - and an > > attempt to make it as an interesting idea someone could take a closer > look > > at. > > > > Manikandan, > > > > Great to hear from you, it's fantastic to hear from maintainers of other > > projects. But for your information - we are now in the process of > > completely rewriting part of Airflow 3, which means that we are deep down > > and busy in various parts - and as Ash mentioned, we do not want to > > "complicate" things but adding more **just now**. > > > > And I really love the idea of starting something outside in parallel. > There > > are a number of people here who are not that deeply involved in Airflow 3 > > and they could take on the discussion - and maybe even work with > Manikandan > > directly on such an executor - somewhere on the side - just to explore > more > > of what Manikandan just started. > > > > I'd be curious to see the result of such a work where someone with deeper > > Airflow understanding (but not necessarily involved in Airflow 3 work) > > could make some self-guided experiment and look at what could be achieved > > even by developing an executor POC that would work for Airflow 2 ? > > > > Maybe someone ? > > > > > > J. > > > > On Tue, Oct 22, 2024 at 1:22 PM Ash Berlin-Taylor <a...@apache.org> > wrote: > > > > > This looks like it has some really cool features. > > > > > > > *Tl;DR; I would love to start discussion about creating (for Airflow > > 3.x > > > - > > > it does not have to be Airflow 3.0) a new community executor based on > > > YuniKorn* > > > > > > I think this caveat to me is the main point, as long as it’s not in 3.0 > > > (and ideally for me not even in repo for the next few months) for two > > > reasons: > > > > > > 1. AIP 72 is going to change the Executor interface somewhat, and we > > don’t > > > know the exact details of how yet, so having to not worry about another > > > executor to fix up and ensure works would be good to now slow down > > > development of 3.0; and > > > 2. I’m slightly nervous about the extra support load of a new executor > at > > > this time. It’s probably not all that much on Airflow side of things, > but > > > this is just an unknown risk to me right now. > > > > > > Would it be possible to develop this out-of-tree for the time being? > > > > > > Thanks, > > > Ash > > > > > > > On 18 Oct 2024, at 08:41, Shubham Raj <shubhamraj....@gmail.com> > > wrote: > > > > > > > > Hi Jarek, Amogh, and everyone, > > > > > > > > I wanted to share my thoughts on the proposal to integrate YuniKorn, > > and > > > > I'm definitely on board with it! As others mentioned, adding YuniKorn > > as > > > > another executor could really enhance our scheduling capabilities, > > > > especially for the more complex scenarios that Celery and Kubernetes > > > > executors struggle with. > > > > > > > > One of the standout features of YuniKorn is its hierarchical queueing > > and > > > > resource quota management, which is fantastic for handling > multi-tenant > > > > environments. This will help us keep resource-heavy Airflow tasks > from > > > > bogging down shared clusters and ensure that resources are allocated > > > fairly > > > > across different services. Now, regarding gang scheduling as per my > > > > understanding, I think it’s interesting to note that Airflow operates > > on > > > a > > > > sequential model because of its DAG structure, tasks must wait for > > their > > > > dependencies to finish before they can run. This might seem at odds > > with > > > > the idea of gang scheduling, but there are definitely scenarios where > > it > > > > could be useful. For instance, if we have several independent data > > > > processing tasks that need to share resources, gang scheduling could > > help > > > > us optimize resource use and reduce latency by allowing those tasks > to > > be > > > > scheduled at the same time. > > > > > > > > Overall, I believe that integrating these YuniKorn features could > > really > > > > boost Airflow’s capabilities, especially for complex workflows or > > atleast > > > > in resource-constrained environments. Looking forward to hearing > > > everyone’s > > > > thoughts! > > > > > > > > Thanks & Regards, > > > > Shubham > > > > > > > > On Fri, Oct 18, 2024 at 10:19 AM Amogh Desai < > amoghdesai....@gmail.com > > > > > > > wrote: > > > > > > > >> Hi Jarek, Everyone, > > > >> > > > >> Thanks for starting this discussion! > > > >> I agree with everyone so far that this will be more of an additional > > > >> executor rather than a replacement for > > > >> anything we currently have. > > > >> > > > >> I had submitted a talk that was mainly trying to explain about how > we > > > can > > > >> leverage some features of Yunikorn > > > >> such as priority scheduling, multi tenancy (per deployment in terms > of > > > >> resources) and preemption. > > > >> Not all of these features are fully implemented / integrated yet, > but > > I > > > had > > > >> planned to explore them and share my > > > >> findings if my session got selected. I was trying to explore mainly > > > around > > > >> integration with hierarchical queues > > > >> and resource quotas. > > > >> > > > >> To set a tone, we already have some examples running in our cluster > > > >> deployments. We use Airflow in Kubernetes > > > >> with theK8sExecutor, where we share space to run Airflow jobs and > > other > > > >> data engineering workloads. > > > >> > > > >> Via the integration with Yunikorn, we are able to achieve a few > > things: > > > >> 1. Priority Scheduling > > > >> We’ve set priorities for different services running in our cluster. > > For > > > >> example, let's say, both Airflow jobs and Spark jobs > > > >> run in a cluster. We prioritize Spark Drivers equally with Airflow > > > workers, > > > >> which ensures that Airflow workers get more > > > >> priority over Spark Executors. This way, Airflow schedules won’t be > > > missed, > > > >> and it doesn’t negatively impact > > > >> spark jobs because they can still run with fewer executors. > > > >> > > > >> 2. Resource Quotas: We also link Airflow namespaces (where the > workers > > > and > > > >> the core services run) with resource quotas > > > >> to prevent a malformed or a resource heavy Airflow task from taking > > over > > > >> the entire K8s cluster with a faulty DAG. This is > > > >> important since we have both Airflow and other data engineering > > > workloads > > > >> running together. > > > >> > > > >> I had a chat with some folks from the Yunikron team and apart from > > > this, I > > > >> think a few other features of Yunikorn such as > > > >> gang scheduling, preemption, etc. could be beneficial to Airflow: > > > >> 1. Gang Scheduling > > > >> Airflow DAGs generally have a pattern where tasks are dependent on > > each > > > >> other - so lets say task1 -> task2 -> task3 ... > > > >> So even though there are so many tasks, there's just one DAG > process. > > So > > > >> this could benefit from gang scheduling. > > > >> If the whole task set can be considered as a single app and benefit > > from > > > >> gang scheduling. For those of you who > > > >> aren't too familiar with gang scheduling, gang scheduling can be > > > thought of > > > >> as waiting for all your friends to join you > > > >> for a game rather than waiting for them one by one (easiest example > I > > > could > > > >> think of). > > > >> > > > >> 2. Preemption > > > >> We can think of different angles to preemption based on the use > cases. > > > Like > > > >> preempting the entire app instead of using a > > > >> per request preemption OR not preempting a task if it has a > dependent > > > task > > > >> because preemption is expensive. > > > >> > > > >> Overall, I believe the community would benefit from this > integration, > > > and I > > > >> think the Yunikorn team will support it as well. > > > >> > > > >> Thanks & Regards, > > > >> Amogh Desai > > > >> > > > >> > > > >> On Thu, Oct 17, 2024 at 11:06 PM Jarek Potiuk <ja...@potiuk.com> > > wrote: > > > >> > > > >>>> As Jens said "K8sExecutor++". > > > >>>> Just to be precise, I don't believe that this can be a replacement > > for > > > >>> Celery Executor (at least at first glance). > > > >>> > > > >>> Yes. Fully agree. My bad framing from the initial message. > > > >>> > > > >>>> I also believe that for this to be effective, this will need some > > > >>> dedicated work including additional information about the task. > > > >>> > > > >>> Oh absolutely. For me it's more of a (when we agree it's a good > > > >> direction) > > > >>> - let's keep it as something that **might** eventually happen and > not > > > in > > > >>> 3.0. This is really "if we hear more cases that it might solve, > let's > > > see > > > >>> if we need any changes in current Airflow 3 work to enable it or > make > > > it > > > >>> easier." kinda thing. More like making a mental space for this to > > > happen > > > >>> when we are discussing other things. Last thing I want to do is to > > add > > > >> more > > > >>> substantial work for our 3.0 efforts. > > > >>> > > > >>>> I am very curious for Amogh to chime in on this :) > > > >>> > > > >>> Knowing that there was a talk in-preparation, me too :D > > > >>> > > > >>>> The biggest decision is whether this is a community managed > executor > > > or > > > >>> if we can find stakeholders to create this outside of Airflow > (those > > > >>> stakeholders could be some of us from the community). > > > >>> > > > >>> That's an excellent point Niko. Yes. It could be done outside. It > > could > > > >> be > > > >>> done by Yunikorn people (unlikely - they likely have more work than > > > they > > > >>> can handle) or one of the stakeholders (at least initially) - and > > > >> published > > > >>> and released and battle-tested by them and eventually contributed > to > > > the > > > >>> community. This is I think a very good pattern for Open Source, > where > > > >>> commercial users might reap the benefits of their investment as > > "first > > > >>> movers" while paying the price for "teething problems" - but then > > > later > > > >>> contributing back to the community. A company starting with C and > > > ending > > > >>> with a comes to my mind immediately as an obvious candidate if you > > ask > > > >> me. > > > >>> > > > >>> J. > > > >>> > > > >>> > > > >>> On Thu, Oct 17, 2024 at 7:19 PM Oliveira, Niko > > > >> <oniko...@amazon.com.invalid > > > >>>> > > > >>> wrote: > > > >>> > > > >>>> I love the idea. Generally it is quite easy now to add new > executors > > > >> and > > > >>>> there is no harm in having more options. I don't think we need to > > > >> justify > > > >>>> it as a replacement of anything honestly. > > > >>>> > > > >>>> The biggest decision is whether this is a community managed > executor > > > or > > > >>> if > > > >>>> we can find stakeholders to create this outside of Airflow (those > > > >>>> stakeholders could be some of us from the community). > > > >>>> > > > >>>> Cheers, > > > >>>> Niko > > > >>>> > > > >>>> ________________________________ > > > >>>> From: Vikram Koka <vik...@astronomer.io.INVALID> > > > >>>> Sent: Wednesday, October 16, 2024 4:13:27 PM > > > >>>> To: dev@airflow.apache.org > > > >>>> Subject: RE: [EXT] [DISCUSS] Create community "Apache YuniKorn" > > > >> executor > > > >>> ? > > > >>>> > > > >>>> CAUTION: This email originated from outside of the organization. > Do > > > not > > > >>>> click links or open attachments unless you can confirm the sender > > and > > > >>> know > > > >>>> the content is safe. > > > >>>> > > > >>>> > > > >>>> > > > >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > > > >> externe. > > > >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous > ne > > > >>> pouvez > > > >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas > > certain > > > >>> que > > > >>>> le contenu ne présente aucun risque. > > > >>>> > > > >>>> > > > >>>> > > > >>>> I am supportive of this in the long term (i.e. post-3.0) as an > > > >> additional > > > >>>> Executor similar to the Kubernetes Executor. > > > >>>> As Jens said "K8sExecutor++". > > > >>>> > > > >>>> Just to be precise, I don't believe that this can be a replacement > > for > > > >>>> Celery Executor (at least at first glance). > > > >>>> > > > >>>> I also believe that for this to be effective, this will need some > > > >>> dedicated > > > >>>> work including additional information about the task. > > > >>>> I am very curious for Amogh to chime in on this :) > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Tue, Oct 15, 2024 at 1:58 PM Jarek Potiuk <ja...@potiuk.com> > > > wrote: > > > >>>> > > > >>>>> Yeah - it was a bit of dramatisation when I recalled the Celery > > > >>>>> "replacement" ;) . And yes it's not really "alternative" to > Celery, > > > >>>> Celery > > > >>>>> is there to stay for short tasks. > > > >>>>> > > > >>>>> Almost by definition it is meant to run more heavy tasks (for > > example > > > >>>> batch > > > >>>>> inference) where multiple tasks running in parallel share the > same > > > >> GPU > > > >>>> for > > > >>>>> example - because that's what we want to optimize. > > > >>>>> > > > >>>>> And yes - it provides features that K8S executor does not - gang > > > >>>>> scheduling, and sophisticated preemption logic. > > > >>>>> > > > >>>>> J. > > > >>>>> > > > >>>>> On Tue, Oct 15, 2024 at 8:40 PM Jens Scheffler > > > >>>> <j_scheff...@gmx.de.invalid > > > >>>>>> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Hi Jarek, > > > >>>>>> > > > >>>>>> scanning but not reading the full docs I understand that > YuniKorn > > > >> is > > > >>> a > > > >>>>>> specialized, more advanced K8sExecutor - all workload also runs > in > > > >>>> PODs? > > > >>>>>> > > > >>>>>> If this is the right understanding then it might be a > > K8sExecutor++ > > > >>> or > > > >>>>>> could replace this... but Celery is playing very good usually if > > > >> you > > > >>>>>> have very small and high-frequency tasks. Don't know if I > > > >>> mis-interpret > > > >>>>>> the docs... but would it be scaling down to very small > > > >>>>>> PythonOperator/@task decorated tasks with a few lines of code as > > > >>> well? > > > >>>>>> > > > >>>>>> Jens > > > >>>>>> > > > >>>>>> On 15.10.24 12:55, Jarek Potiuk wrote: > > > >>>>>>> Hello here, > > > >>>>>>> > > > >>>>>>> *Tl;DR; I would love to start discussion about creating (for > > > >>> Airflow > > > >>>>> 3.x > > > >>>>>> - > > > >>>>>>> it does not have to be Airflow 3.0) a new community executor > > > >> based > > > >>> on > > > >>>>>>> YuniKorn* > > > >>>>>>> > > > >>>>>>> You might remember my point "replacing Celery Executor" when I > > > >>> raised > > > >>>>> the > > > >>>>>>> Airflow 3 question. I never actually "meant" to replace (and > > > >>> remove) > > > >>>>>> Celery > > > >>>>>>> Executor, but I was more in a quest to see if we have a viable > > > >>>>>> alternative. > > > >>>>>>> > > > >>>>>>> And I think we have one with Apache Yunicorn. > > > >>>>>> https://yunikorn.apache.org/ > > > >>>>>>> > > > >>>>>>> While it is not a direct replacement (so I'd say it should be > an > > > >>>>>> additional > > > >>>>>>> executor), I think Yunikorn can provide us with a number of > > > >>> features > > > >>>>> that > > > >>>>>>> we currently cannot give to our users and from the discussions > I > > > >>> had > > > >>>>> and > > > >>>>>>> talk I saw at the Community Over Code in Denver, I believe it > > > >> might > > > >>>> be > > > >>>>>>> something that might make Airflow also more capable especially > in > > > >>> the > > > >>>>>>> "optimization wars" context that I wrote about in > > > >>>>>>> > https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r > > > >>>>>>> > > > >>>>>>> It seems like quite a good fit for the "Inference" use case > that > > > >> we > > > >>>>> want > > > >>>>>> to > > > >>>>>>> support for Airflow 3. > > > >>>>>>> > > > >>>>>>> At the Community Over Code I attended a talk (and had quite > nice > > > >>>>>> follow-up > > > >>>>>>> discussion) from Apple engineers - named: "Maximizing GPU > > > >>>> Utilization: > > > >>>>>>> Apache YuniKorn Preemption" and had a very long discussion with > > > >>>>> Cloudera > > > >>>>>>> people who are using YuniKorn for years to optimize their > > > >>> workloads. > > > >>>>>>> > > > >>>>>>> The presentation is not recorded, but I will try to get slides > > > >> and > > > >>>> send > > > >>>>>> it > > > >>>>>>> your way. > > > >>>>>>> > > > >>>>>>> I think we should take a close look at it - because it seems > to > > > >>>> save a > > > >>>>>> ton > > > >>>>>>> of implementation effort for the Apple team running Batch > > > >> inference > > > >>>> for > > > >>>>>>> their multi-tenant internal environment - which I think is > > > >>> precisely > > > >>>>> what > > > >>>>>>> you want to do. > > > >>>>>>> > > > >>>>>>> YuniKorn (https://yunikorn.apache.org/) is an "app-aware" > > > >>> scheduler > > > >>>>> that > > > >>>>>>> has a number of queue / capacity management models, policies > that > > > >>>> allow > > > >>>>>>> controlling various applications - competing for GPUs from a > > > >> common > > > >>>>> pool. > > > >>>>>>> > > > >>>>>>> They mention things like: > > > >>>>>>> > > > >>>>>>> * Gang Scheduling / with gang scheduling preemption where there > > > >> are > > > >>>>>>> workloads requiring minimum number of workers > > > >>>>>>> * Supports Latency sensitive workloads > > > >>>>>>> * Resource quota management - things like priorities of > execution > > > >>>>>>> * YuniKorn preemption - with guaranteed capacity and preemption > > > >>> when > > > >>>>>> needed > > > >>>>>>> - which improves the utilisation > > > >>>>>>> * Preemption that minimizes preemption cost (Pod level > preemption > > > >>>>> rather > > > >>>>>>> than application level preemption) - very customizable > preemption > > > >>>> with > > > >>>>>>> opt-in/opt-out, queues, resource weights, fencing, supporting > > > >>>> fifo/lifo > > > >>>>>>> sorting etc. > > > >>>>>>> * Runs in Cloud and on-premise > > > >>>>>>> > > > >>>>>>> The talk described quite a few scenarios of > > > >> preemption/utilization/ > > > >>>>>>> guaranteed resources etc. They also outlined on what YuniKorn > > > >> works > > > >>>> on > > > >>>>>> new > > > >>>>>>> features (intra-queue preemption etc.) and what future things > can > > > >>> be > > > >>>>>> done. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Coincidentally - Amogh Desai with a friend submitted a talk for > > > >>>> Airflow > > > >>>>>>> Summit: > > > >>>>>>> > > > >>>>>>> "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn" > > > >>>>>>> > > > >>>>>>> Which did not make it to the Summit (other talk of Amogh did) - > > > >>> but I > > > >>>>>> think > > > >>>>>>> back then we have not realized about the potential of utilising > > > >>>>> YuniKorn > > > >>>>>> to > > > >>>>>>> optimize workflows managed by Airflow. > > > >>>>>>> > > > >>>>>>> But we seem to have people in the community who know more about > > > >>>>> YuniKorn > > > >>>>>> <> > > > >>>>>>> Airflow relation (Amogh :) ) and could probably comment and add > > > >>> some > > > >>>>>> "from > > > >>>>>>> the trenches" experience to the discussion. > > > >>>>>>> > > > >>>>>>> Here is the description of the talk that Amoghs submitted: > > > >>>>>>> > > > >>>>>>> Multi-tenant Airflow is hard and there have been novel > approaches > > > >>> in > > > >>>>> the > > > >>>>>>> recent past to converge this gap. A key obstacle in > multi-tenant > > > >>>>> Airflow > > > >>>>>> is > > > >>>>>>> the management of cluster resources. This is crucial to avoid > one > > > >>>>>> malformed > > > >>>>>>> workload from hijacking an entire cluster. It is also vital to > > > >>>> restrict > > > >>>>>>> users and groups from monopolizing resources in a shared > cluster > > > >>>> using > > > >>>>>>> their workloads. > > > >>>>>>> > > > >>>>>>> To tackle these challenges, we turn to Apache YuniKorn, a K8s > > > >>>> scheduler > > > >>>>>>> catering all kinds of workloads. We leverage YuniKorn’s > > > >>> hierarchical > > > >>>>>> queues > > > >>>>>>> in conjunction with resource quotas to establish multi-tenancy > at > > > >>>> both > > > >>>>>> the > > > >>>>>>> shared namespace level and within individual namespaces where > > > >>> Airflow > > > >>>>> is > > > >>>>>>> deployed. > > > >>>>>>> > > > >>>>>>> YuniKorn also introduces Airflow to a new dimension of > > > >> preemption. > > > >>>> Now, > > > >>>>>>> Airflow workers can preempt resources from lower-priority jobs, > > > >>>>> ensuring > > > >>>>>>> critical schedules in our data pipelines are met without > > > >>> compromise. > > > >>>>>>> > > > >>>>>>> Join us for a discussion on integrating Airflow with YuniKorn, > > > >>>>> unraveling > > > >>>>>>> solutions to these multi-tenancy challenges. We will also share > > > >> our > > > >>>>> past > > > >>>>>>> experiences while scaling Airflow and the steps we have taken > to > > > >>>> handle > > > >>>>>>> real world production challenges in equitable multi-tenant K8s > > > >>>>> clusters. > > > >>>>>>> > > > >>>>>>> I would love to hear what you think about it. I know we are > deep > > > >>> into > > > >>>>>>> Airflow 3.0 implementation - but that one can be > > > >>>> discussed/implemented > > > >>>>>>> independently and maybe it's a good idea to start doing it > > > >> earlier > > > >>>> than > > > >>>>>>> later if we see that it has good potential. > > > >>>>>>> > > > >>>>>>> J. > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >> > --------------------------------------------------------------------- > > > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > > >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > > > > > >