If it's worked on outside of the main tree and generally outside of the community repos - it does not have to have an AIP.
You could do it in your personal account or with your employee etc. - and test promote here in the usual ways (https://airflow.apache.org/ecosystem/) - but then when we see it working we can discuss about adopting it by the community (if you will be good with it) - and then we might think whether it needs an AIP or can be adopted "as is" or maybe after some initial discussion when we might agree on some changes needed. This is also a faster way to get things up and running and experiment, because you can single-handedly or with Manikendan or whoever else will be willing to help - make faster decisions and take whatever shortcuts you want. Only at the stage of "hey this is worthwhile to be adopted by the community and we wish to pay the price of maintaining it here" where any AIP/Discussions/voting etc. will be engaged. Same like for example Cosmos - which was (still is) solely developed, managed, released, promoted etc. by Astronomer. We had some thinking and chats (but not even public discussion) whether we would like to have it in the community. That would require a) Astronomer to decide to propose to donate it at some point in time and b) community decision to accept the donation (or some modified variant of it). So far none of those questions have even been asked formally :) - and I guess it's not the right time to even ask them yet, but currently it's developed solely by Astronomer and outside of the community and they can do whatever they want. You can still of course - consider asking for help and get some feedback and help (I am for one super happy to do so) - but this is purely up to you and whoever will be "owning" the code :) J. On Wed, Oct 30, 2024 at 11:13 AM Amogh Desai <amoghdesai....@gmail.com> wrote: > Right! > > Let me try and analyse the impact here and try to come up with a plan on > how we can > expand on this area. As Ash mentioned earlier, it doesn't have to be a > committed item, but > this is something that might call for an AIP(?) and can be worked on > outside the main tree? > > Thanks & Regards, > Amogh Desai > > > On Tue, Oct 29, 2024 at 7:36 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > This all looks really good - sounds like something that we could only do > in > > K8S executor and likely even make it compatible with Airflow 2 and > release > > independently > > > > On Tue, Oct 29, 2024 at 1:30 PM Amogh Desai <amoghdesai....@gmail.com> > > wrote: > > > > > > As I understand what It means - If I read it correctly, that it's > > mostly > > > a > > > deployment issue - we don't even have to have YuniKorn Executor - we > can > > > use K8S Executor, and it will work out of the box, with scheduling > > > controlled by YuniKorn, but then we need to find a way to configure > > > behaviour of tasks and dags (likely via annotations of pods maybe?). > That > > > would mean that it's mostly a documentation on "How I can leverage > > YuniKorn > > > with Airflow" + maybe a helm chart modification to install YuniKorn as > an > > > option? > > > > > > And then likely we need to add a little bit of metadata and some > mapping > > of > > > "task" or "dag" or "task group" properties to open-up more capabilities > > of > > > YuniKorn Scheduling ? > > > > > > Do I understand correctly? > > > > > > You mostly summed it up. But a few things. > > > Yes, we can open up Yunikorn to schedule Airflow workloads just by > doing > > > basically > > > nothing or at most very little manual work. > > > > > > But to really enable Yunikorn in full power, we will have to make some > > > changes to the > > > Airflow codebase. A few things at the top of my head: > > > The admission controller will take care of the applicationId and > > scheduler > > > name etc, but from > > > an initial read, if we want things like - "schedule dags to a certain > > queue > > > only" or something of > > > that sort, we will need some labels to be injected or even a level > above, > > > get the KPO to add > > > some labels etc, like a queue. > > > OR > > > even if we can specify the queue for every operator by extending the > > > BaseOperator, that would be cool > > > too. > > > > > > I personally think if we could extend the KubernetesExecutor to > > > YunikornExecutor (naming doesn't matter > > > to me), we can handle things like installing Yunikorn along with > Airflow > > by > > > making changes for helm charts, > > > make it come up with the scheduler, admission controller, etc. We will > > able > > > to make code changes for Airflow > > > by controlling the internal logic with the executor type instead of > > lending > > > it all the way to the end user (I > > > mean options like the label injection, labelling all the tasks of a > group > > > as an application, to adhere to Jarek's > > > thought). > > > > > > Manikandan, feel free to add anything more from the Yunikorn side in > > case I > > > have misinterpreted or > > > just generally missed :) > > > > > > > > > Thanks & Regards, > > > Amogh Desai > > > > > > > > > On Tue, Oct 29, 2024 at 1:28 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > This is cool. > > > > > > > > As I understand what It means - If I read it correctly, that it's > > mostly > > > a > > > > deployment issue - we don't even have to have YuniKorn Executor - we > > can > > > > use K8S Executor, and it will work out of the box, with scheduling > > > > controlled by YuniKorn, but then we need to find a way to configure > > > > behaviour of tasks and dags (likely via annotations of pods maybe?). > > That > > > > would mean that it's mostly a documentation on "How I can leverage > > > YuniKorn > > > > with Airflow" + maybe a helm chart modification to install YuniKorn > as > > an > > > > option? > > > > > > > > And then likely we need to add a little bit of metadata and some > > mapping > > > of > > > > "task" or "dag" or "task group" properties to open-up more > capabilities > > > of > > > > YuniKorn Scheduling ? > > > > > > > > Do I understand correctly? > > > > > > > > > 1. Yunikorn treats applications at the DAG level not at the task > > level, > > > > > which is great. Due to this, we can try to leverag > > > > > gang scheduling abilities of Yunikorn. > > > > > > > > This is great. I was wondering if we could also allow the application > > on > > > > the "Task Group" level. I find it is a really interesting feature to > be > > > > able to treat a "Task Group" as an entity that we could treat as > > > > "application" - this way you could treat the "Task Group" as > > "schedulable > > > > entity" and for example set pre-emption properties for all tasks in > the > > > > same task group. Or Gang scheduling for the task group ("Only > schedule > > > > tasks in the task group when there is enough resources for the whole > > task > > > > group". Or - and this is something that I think as a "holy grail" of > > > > scheduling in the context of optimisation of machine learning > > workflows: > > > > "Make sure that all the tasks in a group are scheduled on the the > same > > > node > > > > and use the same local hardware resources" + if any of them fail, > retry > > > the > > > > whole group - also on the same instance (I think this is partially > > > possible > > > > with some node affinity setup - but I would love if we should be able > > to > > > > set a property on a task group effectively meaning ("Execute all > tasks > > in > > > > the group on the same hardware") - so a bit higher abstraction, and > > have > > > > YuniKorn handle all the pre-emption and optimisations of scheduling > for > > > > that. > > > > > > > > > 2. With the admission controller running, even the older DAGs will > be > > > > able > > > > > to benefit from the Yunikorn scheduling ablities > > > > > > > > > > without the need to make changes to the DAGs. This means that the > > same > > > > DAG > > > > > will run with default scheduler (K8s default) > > > > > > > > > as well as Yunikorn if need be! > > > > > > > > Fantastic! > > > > > > > > 3. As Mani mentioned, preemption capabilities can be explored due to > > this > > > > as well. > > > > > > > > I am happy to work on this effort and looking forward to it. > > > > > > > > > Yeah that would be cool - also see above, I think if we will be > able > > to > > > > have some "light touch" integration with Yunikorn, where we could > > handle > > > > "Task Group" as schedulable entity + have some higher level > > abstractions > > > / > > > > properties of it that would map into some "scheduling behaviour" - > > > > preemption/gang scheduling and document it, that would be great and > > easy > > > > way of expanding Airflow capabilities - especially for ML workflows. > > > > > > > > J. > > > > > > > > > > > > On Tue, Oct 29, 2024 at 8:10 AM Amogh Desai < > amoghdesai....@gmail.com> > > > > wrote: > > > > > > > > > Building upon the POC done by Manikandan, I tried my hands at an > > > > experiment > > > > > too. > > > > > > > > > > I wanted to mainly experiment with the Yunikorn admission > controller, > > > > with > > > > > an aim to make > > > > > > > > > > no changes to my older DAGs. > > > > > > > > > > > > > > > Deployed a setup that looks like this: > > > > > > > > > > - Deployed Yunikorn in a kind cluster with the default > > configurations. > > > > The > > > > > default configurations launches the > > > > > > > > > > Yunikorn scheduler as well as an admission controller which watches > > > for a > > > > > `yunikorn-configs` configmap that > > > > > > > > > > can define queues, partitions, placement rules etc. > > > > > > > > > > - Deployed Airflow using helm charts in the same kind cluster while > > > > > specifying the executor as KubernetesExecutor. > > > > > > > > > > > > > > > > > > > > Wanted to test out if Yunikorn can take over the scheduling of > > Airflow > > > > > workers. Created some queues using this > > > > > > > > > > config present here: > > > > > > > > > > > > > > > > > > > > https://github.com/apache/yunikorn-k8shim/blob/master/deployments/examples/namespace/queues.yaml > > > > > > > > > > > > > > > Tried running the Airflow K8s executor dag > > > > > < > > > > > > > > > > > > > > > https://github.com/apache/airflow/blob/main/airflow/example_dags/example_kubernetes_executor.py > > > > > > > > > > > without > > > > > any changes to the DAG. > > > > > > > > > > I was able to run the DAG successfully. > > > > > > > > > > > > > > > Results > > > > > > > > > > 1. The task pods get scheduled by Yunikorn instead of the default > K8s > > > > > scheduler > > > > > > > > > > > > > > > 2. I was able to observe a single application run for the Airflow > DAG > > > in > > > > > the Yunikorn UI. > > > > > > > > > > > > > > > Observations > > > > > > > > > > 1. Yunikorn treats applications at the DAG level not at the task > > level, > > > > > which is great. Due to this, we can try to leverage > > > > > > > > > > gang scheduling abilities of Yunikorn. > > > > > > > > > > 2. With the admission controller running, even the older DAGs will > be > > > > able > > > > > to benefit from the Yunikorn scheduling ablities > > > > > > > > > > without the need to make changes to the DAGs. This means that the > > same > > > > DAG > > > > > will run with default scheduler (K8s default) > > > > > > > > > > as well as Yunikorn if need be! > > > > > > > > > > 3. As Mani mentioned, preemption capabilities can be explored due > to > > > this > > > > > as well. > > > > > > > > > > > > > > > I am happy to work on this effort and looking forward to it. > > > > > > > > > > > > > > > > > > > > Thanks & Regards, > > > > > Amogh Desai > > > > > > > > > > > > > > > On Tue, Oct 15, 2024 at 4:26 PM Jarek Potiuk <ja...@potiuk.com> > > wrote: > > > > > > > > > > > Hello here, > > > > > > > > > > > > *Tl;DR; I would love to start discussion about creating (for > > Airflow > > > > 3.x > > > > > - > > > > > > it does not have to be Airflow 3.0) a new community executor > based > > on > > > > > > YuniKorn* > > > > > > > > > > > > You might remember my point "replacing Celery Executor" when I > > raised > > > > the > > > > > > Airflow 3 question. I never actually "meant" to replace (and > > remove) > > > > > Celery > > > > > > Executor, but I was more in a quest to see if we have a viable > > > > > alternative. > > > > > > > > > > > > And I think we have one with Apache Yunicorn. > > > > > https://yunikorn.apache.org/ > > > > > > > > > > > > While it is not a direct replacement (so I'd say it should be an > > > > > additional > > > > > > executor), I think Yunikorn can provide us with a number of > > features > > > > that > > > > > > we currently cannot give to our users and from the discussions I > > had > > > > and > > > > > > talk I saw at the Community Over Code in Denver, I believe it > might > > > be > > > > > > something that might make Airflow also more capable especially in > > the > > > > > > "optimization wars" context that I wrote about in > > > > > > https://lists.apache.org/thread/1mp6jcfvx67zd3jjt9w2hlj0c5ysbh8r > > > > > > > > > > > > It seems like quite a good fit for the "Inference" use case that > we > > > > want > > > > > to > > > > > > support for Airflow 3. > > > > > > > > > > > > At the Community Over Code I attended a talk (and had quite nice > > > > > follow-up > > > > > > discussion) from Apple engineers - named: "Maximizing GPU > > > Utilization: > > > > > > Apache YuniKorn Preemption" and had a very long discussion with > > > > Cloudera > > > > > > people who are using YuniKorn for years to optimize their > > workloads. > > > > > > > > > > > > The presentation is not recorded, but I will try to get slides > and > > > send > > > > > it > > > > > > your way. > > > > > > > > > > > > I think we should take a close look at it - because it seems to > > > save a > > > > > ton > > > > > > of implementation effort for the Apple team running Batch > inference > > > for > > > > > > their multi-tenant internal environment - which I think is > > precisely > > > > what > > > > > > you want to do. > > > > > > > > > > > > YuniKorn (https://yunikorn.apache.org/) is an "app-aware" > > scheduler > > > > that > > > > > > has a number of queue / capacity management models, policies that > > > allow > > > > > > controlling various applications - competing for GPUs from a > common > > > > pool. > > > > > > > > > > > > They mention things like: > > > > > > > > > > > > * Gang Scheduling / with gang scheduling preemption where there > are > > > > > > workloads requiring minimum number of workers > > > > > > * Supports Latency sensitive workloads > > > > > > * Resource quota management - things like priorities of execution > > > > > > * YuniKorn preemption - with guaranteed capacity and preemption > > when > > > > > needed > > > > > > - which improves the utilisation > > > > > > * Preemption that minimizes preemption cost (Pod level preemption > > > > rather > > > > > > than application level preemption) - very customizable preemption > > > with > > > > > > opt-in/opt-out, queues, resource weights, fencing, supporting > > > fifo/lifo > > > > > > sorting etc. > > > > > > * Runs in Cloud and on-premise > > > > > > > > > > > > The talk described quite a few scenarios of > preemption/utilization/ > > > > > > guaranteed resources etc. They also outlined on what YuniKorn > works > > > on > > > > > new > > > > > > features (intra-queue preemption etc.) and what future things can > > be > > > > > done. > > > > > > > > > > > > > > > > > > Coincidentally - Amogh Desai with a friend submitted a talk for > > > Airflow > > > > > > Summit: > > > > > > > > > > > > "A Step Towards Multi-Tenant Airflow Using Apache YuniKorn" > > > > > > > > > > > > Which did not make it to the Summit (other talk of Amogh did) - > > but I > > > > > think > > > > > > back then we have not realized about the potential of utilising > > > > YuniKorn > > > > > to > > > > > > optimize workflows managed by Airflow. > > > > > > > > > > > > But we seem to have people in the community who know more about > > > > YuniKorn > > > > > <> > > > > > > Airflow relation (Amogh :) ) and could probably comment and add > > some > > > > > "from > > > > > > the trenches" experience to the discussion. > > > > > > > > > > > > Here is the description of the talk that Amoghs submitted: > > > > > > > > > > > > Multi-tenant Airflow is hard and there have been novel approaches > > in > > > > the > > > > > > recent past to converge this gap. A key obstacle in multi-tenant > > > > Airflow > > > > > is > > > > > > the management of cluster resources. This is crucial to avoid one > > > > > malformed > > > > > > workload from hijacking an entire cluster. It is also vital to > > > restrict > > > > > > users and groups from monopolizing resources in a shared cluster > > > using > > > > > > their workloads. > > > > > > > > > > > > To tackle these challenges, we turn to Apache YuniKorn, a K8s > > > scheduler > > > > > > catering all kinds of workloads. We leverage YuniKorn’s > > hierarchical > > > > > queues > > > > > > in conjunction with resource quotas to establish multi-tenancy at > > > both > > > > > the > > > > > > shared namespace level and within individual namespaces where > > Airflow > > > > is > > > > > > deployed. > > > > > > > > > > > > YuniKorn also introduces Airflow to a new dimension of > preemption. > > > Now, > > > > > > Airflow workers can preempt resources from lower-priority jobs, > > > > ensuring > > > > > > critical schedules in our data pipelines are met without > > compromise. > > > > > > > > > > > > Join us for a discussion on integrating Airflow with YuniKorn, > > > > unraveling > > > > > > solutions to these multi-tenancy challenges. We will also share > our > > > > past > > > > > > experiences while scaling Airflow and the steps we have taken to > > > handle > > > > > > real world production challenges in equitable multi-tenant K8s > > > > clusters. > > > > > > > > > > > > I would love to hear what you think about it. I know we are deep > > into > > > > > > Airflow 3.0 implementation - but that one can be > > > discussed/implemented > > > > > > independently and maybe it's a good idea to start doing it > earlier > > > than > > > > > > later if we see that it has good potential. > > > > > > > > > > > > J. > > > > > > > > > > > > > > > > > > > > >