Re: Discuss: AIP-67 (multi team) now that AIP-82 (External event driven dags) exists

Jarek Potiuk Mon, 07 Jul 2025 11:17:07 -0700

I realized that I owe Niko an explanation of configuration changes. Again -
following the philosophy above - minimal set of changes to "airflow
internals". the "minimum" set of changes that will work. I propose the
change below that has **no** changes to the way how the
current configuration "shared" feature works - it will change the way
executors will retrieve their configuration if they are configured
"per-team" - and we can 100% bank on existing multi-executors.
I believe that will absolutely minimise the set of changes needed to
implement multi-team and we will be able to get it "faster" and with "far
lower risk" of impacting airflow code and say - 3.1 or 3.2 delivery.


Existing multi-executor configuration will be extended to include team
prefix. The prefix will be separated with ":", entries for different teams
will be separated with ";"

[core]

executor = team1:KubernetesExecutor,my.custom.module.Executor
Class;team2:CeleryExecutor

The configuration of executors will also be prefixed with the same team:

[team1:kubernetes_executor]

api_client_retry_configuration = { "total": 3, "backoff_factor": 0.5 }

The environment variables keeping configuration will use ___  (three
underscores) to replace ":". For example:

AIRFLOW__TEAM1___KUBERNETES_EXECUTOR__API_CLIENT_RETRY_CONFIGURATION`


J.

On Thu, Jul 3, 2025 at 8:47 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> > The direction this one is taking is interesting. If you're really just
> trying to make the feature barely possible and mostly targeted towards
> managed providers to implement the rest, then I suppose this hits the mark.
>
> Well actually by taking the direction I took, it's not "mostly for managed
> providers" - i see it as it is equally, for managed providers and on-prem
> users, but also, following the open-source spirit, philosophically, I think
> in Airflow, any such change should be done with those things in mind,
> because we are at the stage where we are already "established' and by
> innovating on top what we have we have sometimes more to lose than to gain
> - so I feel with "deployment' features we should be very careful to
> distinguish 'enabling things" vs. 'doing things". My focus with this
> iteration was to remove all the roadblocks that make it impossible (or
> extremely difficult) to implement "real" multi-team and separation without
> modifying airflow core. I though "what is the minimal set of features that
> will make it "possible" for someone motivated to deploy a single airflow
> for multiple teams.
>
> * minimise maintenance effort increase
> * do not "spoil" the "simple case" - we do not want to add features that
> make "simple" implementation more complex the current `docker run -it
> apache/airflow standalone` - should be simple and straightforward to run
> * if there is anything that involves complex deployment, we should not aim
> to make a "turn-key" solution that we will have to support - similarly like
> we do with our configuration parameters, we have 100s knobs to turn, and as
> long as default settings are reasonable and someone "motivated" can
> configure and fine-tune - this configuration and fine-tuning should be left
> to them - regardless if they are on-prem or managed. And both should be
> able to do it.
>
> I think it's not only smart technically (we support the low-level basic
> features and when someone puts them together and makes it more of a
> turn-key solution they are responsible for designing and implementing it -
> so we have less maintenance effort. But also it's good from a simple
> "open-source business model point of view" - i.e. it's a smart product
> decision we should make.
>
> Why airflow is #18 in OSS rank - of course we have a huge community and
> people contributing in their free time, completely voluntarily. And we
> cherish, support and encourage it. But let's be honest - if not all those
> that make business on top of airflow did not invest literally millions of
> dollars (in terms of engineering salary, sponsoring Airflow Summit,
> supporting people like me (some smart stakeholders at least who understand
> the value of it) who can be good "community spirit" - Airflow would have
> order of magnitude less activity, reach, Airflow 3 would not be simply
> possible. And this is a good thing that we have those stakeholders that are
> interested and make money by turning Airflow into a "turn-key" solution.
> This is a fantastic, symbiotic relationship.
>
> So - what my thinking is - we should NOT make things that make airflow
> more turn-key for those complex cases. We should leave it up to those who
> want to make it and want to charge money for it. This is cool and great
> that they can do - and we should not do it "for them" - but on the other
> hand - we should make it possible that those who want to turn airflow into
> more complex (say multi-team solution) to make it happen - by providing
> them with minimal set of features that make it possible.
>
> And that also - in a way - keeps the balance between on-prem and managed
> implementation.
>
> Something that I've learned as a rule of thumb is that making a feature
> "generic" compared to custom implementation is 3x-10x more expensive (both
> in implementation and maintenance). And it means that if an on-prem user
> wants to implement something for them (say turn-key multi-team solution for
> their case) it will cost `x` , but when a managed provider wants to
> implement a generic multi-team it will cost `10x`. But also managed
> providers can spread the cost over the premium they will charge to their
> users so that they don't have to manage Airflow on their own and pay `x`
> for this mult-team feature to develop on their own. And this is a "fair"
> choice to make by on-prem users. They might choose what they want to do
> then. Also it's fair for managed provider - yes they need to invest more,
> but also they have a chance to shine on promoting it and making it more
> optimised at scale etc. etc.
>
> That is my line of thinking.
>
>
> J.
>
>
> On Thu, Jul 3, 2025 at 1:41 AM Oliveira, Niko <oniko...@amazon.com.invalid>
> wrote:
>
>> Hey Jarek,
>>
>>
>> The direction this one is taking is interesting. If you're really just
>> trying to make the feature barely possible and mostly targeted towards
>> managed providers to implement the rest, then I suppose this hits the mark.
>>
>> But this is not something we're asking for at Amazon and personally I
>> think we should make the feature reasonably usable for those running
>> self-managed OSS Airflow as well. There are many users running an on-prem
>> Airflow. Getting too hyper-fixated on an implementation that's so
>> simplified that it's obtuse and difficult to use by most users seems like
>> the wrong approach to me. But you and I have already discussed this at
>> length and I haven't convinced you so far, so if I'm the only one with this
>> thinking then I'm happy to disagree and commit as we say at Amazon :)
>>
>>
>> > So I would be rather strong on **not** touching the current
>> configuration and
>>
>> simply adding configuration for per-team executors in executor config -
>> even if it is uglier and more "low-level".
>>
>> Can you explain what "adding configuration for per-team executors in
>> executor config" would look like? I don't have a concrete sense of what you
>> mean by this.
>>
>> Thanks for your efforts on trying to get this feature agreed to and voted
>> on. Looking forward to working on the project in the coming weeks!
>>
>> Cheers,
>> Niko
>>
>> ________________________________
>> From: Jarek Potiuk <ja...@potiuk.com>
>> Sent: Tuesday, July 1, 2025 10:26:55 PM
>> To: dev@airflow.apache.org
>> Subject: RE: [EXT] Discuss: AIP-67 (multi team) now that AIP-82 (External
>> event driven dags) exists
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you can confirm the sender and know
>> the content is safe.
>>
>>
>>
>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
>> le contenu ne présente aucun risque.
>>
>>
>>
>> Any last comments ? There is a long weekend coming up in the US, so I will
>> likely start voting on the updated AIP on Monday 7th.
>>
>> On Fri, Jun 27, 2025 at 12:41 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> > I'd really love to finalise discussion and put it up to a vote some time
>> > after the recording from the last dev call is posted - so that more
>> > context, details and the LONG discussion we had on it. There is no
>> *huge*
>> > hurry  - we have strong dependency on Task Isolation and it seems that
>> it
>> > will still take a bit of time to complete, so I'd say I would love to
>> start
>> > voting in about a week time - so that maybe at the next dev call we can
>> > "seal" the subject. Happy to see any more comments - especially from
>> those
>> > who have opinions but they had no opportunity to express them.
>> >
>> > I am personally very happy with the direction it took - simplification
>> and
>> > "MVP" kind of approach - also I invite the stakeholders of ours to take
>> a
>> > close look at the scope and what we really propose - I have a feeling
>> that
>> > we can balance it out - there is something we can make to make it not
>> > "worse" for the offerings they have. I think we have a really good
>> > symbiotic relationship here, and I would love to leverage that. For one
>> -
>> > my goal here is to have a minimum number of changes that are impacting
>> > maintainability of the open-source airflow - but mostly "opening up some
>> > possibilities" - rather than provide turn-key solutions. And mostly
>> because
>> > this is good for all sides - less maintenance and complexity for OSS
>> > maintainers, but more opportunities to make it into "turn-key"
>> solutions by
>> > the stakeholders, while also allowing the "on-prem" users - if they are
>> > highly motivated - to use those features by adding the "turn-key" layer
>> on
>> > their own. Also adding multi-team should not be at the expense of
>> "simple"
>> > installations - they should be virtually unaffected.
>> >
>> > One example of applying this is cutting on "separate config files". I
>> > think it moves us closer to a "turn-key" solution but it is not really
>> > necessary to achieve the three goals above - that's why in the current
>> > proposal this part is completely removed - Sorry Niko, but I still think
>> > it's one of the things that falls into this bucket. We can easily remove
>> > it, they complicate code, documentation and options the users have, and
>> > even if it is a "little" more complex to manage configuration by
>> motivated
>> > users, it's also an opportunity for "turn-key" option that stakeholders
>> can
>> > build in their products - and we do not have to maintain it in the
>> > open-source. So I would be rather strong on **not** touching the current
>> > configuration and simply adding configuration for per-team executors in
>> > executor config - even if it is uglier and more "low-level".
>> >
>> > So if there are some constructive ideas on what can be done to make it
>> > "simpler" and less "turn-key" in that respect - I would highly value
>> such
>> > ideas and comments. If we can cut down something more that is not
>> > "necessary" for the three primary goals I came up with - I am more than
>> > happy to do it.
>> >
>> > Just to remind - those are the "extracted" goals. I slightly updated
>> them
>> > and added to the preamble of the AIP:
>> >
>> > * less operational overhead for managing multi-team (once AIP-72 is
>> > complete) where separate execution environments are important
>> > * virtual assets sharing between teams
>> > * ability of having "admin" and "team sharing" capability where dags
>> from
>> > multiple teams can be seen in a single Airflow UI  (requires custom
>> RBAC an
>> > AIP-56 implementation of Auth Manager - with KeyCloak Auth Manager
>> being a
>> > reference implementation)
>> >
>> > J.
>> >
>> >
>> > On Thu, Jun 26, 2025 at 10:53 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >
>> >>
>> >>> One technical observation: Now that the dag table no longer has a
>> >>> team_id in it, what would the behaviour be when a DAG is attempted to
>> move
>> >>> between bundles? How do we detect this? (I’m not all convinced that we
>> >>> correctly detect duplicate dag ids across bundles today, so I wouldn’t
>> >>> assume or rely on the current behaviour.)
>> >>>
>> >>
>> >> Of course - yes, I realise that - that problem was also not handled in
>> >> the previous iteration to be honest. That is something that dag bundle
>> >> solution allows to solve eventually - but I do not think it's a
>> blocker for
>> >> the proposed implementation. We will have to eventually add some way of
>> >> blocking dags to jump between bundles, we might tackle this
>> separately. I
>> >> already wanted to propose a separate update to that - but I did not
>> want to
>> >> complicate the current proposal. One thing at a time. I can, however -
>> if
>> >> you consider that as a blocker, extend the current AIP with it. Not a
>> big
>> >> problem. This is however a bit independent from the team_id
>> introduction.
>> >>
>> >> Overall, I am still unconvinced this proposal has enough real user
>> >>> benefit over actually separate deployments, and on balance of the
>> added
>> >>> complexity and maintenance burden I do not think it is worth it.
>> >>>
>> >>
>> >> That makes me sad, I thought that over the course of the discussion I
>> >> addressed all the concerns (in this case the concern was "is it worth
>> with
>> >> the cost and little benefit", but when I did it and heavily limited the
>> >> impact, now the concern is "is it worth at all as changes are really
>> >> minimal" - and surely, anyone can change and adapt their concerns, over
>> >> time,  but that one seems like ever-moving target. I hoped at least for
>> >> some acknowledgment of some concerns (complexity in this case) is
>> >> addressed, but it seems that you are deeply convinced that we do not
>> need
>> >> multi-team at all (which is in stark contrast with at least a dozen of
>> >> bigger and smaller users of Airflow who submitted talks to Airflow
>> summit
>> >> (including about 5 or 6 submissions for Airflow 2025) on how they spent
>> >> their engineering effort, time and money on trying to achieve something
>> >> similar - they assessed that it's worth, you  assess that it's not
>> worth.
>> >> Somehow I trust our users that they were not spending the money, time
>> and
>> >> engineering effort to achieve this because they wanted to spend more
>> money.
>> >> I think they assessed it's worth it. So I want to make it a bit easier
>> and
>> >> more "proper" way for them to do that.
>> >>
>> >>>
>> >>> Upgrades: it is not easier to upgrade under this multi team proposal,
>> >>> but much much harder. This is based on hard earned experience from
>> helping
>> >>> Astronomer users — having to coordinate upgrades between multiple
>> teams
>> >>> turns in to a months long slog of the hardest kind of work —  people
>> work:
>> >>> getting other teams to agree to do things that they don’t directly
>> care
>> >>> about — “It’s working for me, I don’t care about upgrading, we’ll get
>> to it
>> >>> next quarter” is a refrain I’ve heard many times.
>> >>>
>> >>
>> >> Yes. absolutely - this is why we deferred it until we knew what shape
>> >> task isolation and other AIPs we depend on take on. Because it is clear
>> >> that pretty much all the problem you explain above are going to be
>> solved
>> >> with task isolation. And it's not just my opinion. If you want to argue
>> >> with it, you likely need to argue with yourself:
>> >> https://github.com/apache/airflow/issues/51545#issuecomment-2980038478
>> .
>> >> Let me quote what you wrote there last week:
>> >>
>> >> Ash Berlin Taylor wrote:
>> >>
>> >> > A tight coupling between task-sdk and any "server side" component is
>> >> the opposite to one of the goals of AIP-72 (I'm not sure we ever
>> explicitly
>> >> said this, but the first point of motivation for the AIP says
>> "Dependency
>> >> conflicts for administrators supporting data teams using different
>> versions
>> >> of providers, libraries, or python packages")
>> >> > In short, my goal with TaskSDK, and the reason for introducing CalVer
>> >> and Cadwyn with the execution API is to end up in a world where you can
>> >> upgrade the Airflow Scheduler/API server interdependently of any worker
>> >> nodes (with the exception that the server must be at least as new as
>> the
>> >> clients)
>> >> > This ability to have version-skew is pretty much non-negotiable to me
>> >> and is (other than other languages) one of primary benefits of AIP-72
>> >>
>> >> If you read yourself from that quote it basically means "it will be
>> easy
>> >> to upgrade airflow independently of workers". So I am a bit confused
>> here.
>> >> Yes, I agree it was difficult, but you yourself explain that when
>> AIP-72
>> >> (which since API-67 has been accepted has always beem prerequisite of
>> it)
>> >> wrote it will be "easy". So I am not sure why you are bringing it now.
>> We
>> >> assume AIP-72 will be completed and this problem will be gone. Let's
>> not
>> >> mention it any more please.
>> >>
>> >> The true separation from TaskSDK will likely only land in about 3.2
>> time
>> >>> frame. We are actively working on it, but it’s a slow process of
>> untangling
>> >>> lots of assumptions made in the code base over the years. Maybe once
>> we
>> >>> have that my view would be different, but right now I think this
>> makes the
>> >>> proposal a non-starter. Especially as you are saying that most teams
>> will
>> >>> have unique connections. If they’ve got those already, then having an
>> asset
>> >>> trigger use those conns to watch/poll for activity is a much easier
>> >>> solution to operate and crucially, to scale and upgrade.
>> >>>
>> >>
>> >> Yes. I perfectly understand that and I am fully aware of potentially
>> 3.2
>> >> time-frame. And that's fine. Actually I heartily invite you to listen
>> to
>> >> the part of my talk from Berlin Buzzwords when I was asked for the
>> timeline
>> >> - https://youtu.be/EyhZOnbwc-4?t=2226 - this link leads to the exact
>> >> timeline in my talk . My answer was basically - "3.1" or "3.2", and I
>> >> sincerely hope "3.1" but we might not be able to complete it because we
>> >> have other things to do (other - is indeed the Task Isolation work
>> that you
>> >> are leading). And that's perfectly fine. And it absolutely does not
>> prevent
>> >> us from voting on the AIP now - similarly as we voted on the previous
>> >> version of the AIP - knowing that it has some prerequisites a few
>> months
>> >> ago. Especially that we know that the feature we need from task
>> isolation
>> >> is "non-negotiable". I.e. it WILL happen. We don't hope for it, we
>> know it
>> >> will be there. Those are your own words.
>> >>
>> >>
>> >>> >  I think we can’t compare AIP-82 to sharing virtual assets due to
>> >>> complexity of it.
>> >>>
>> >>> Virtual Assets was a mistake, and not how users actually want to use
>> >>> them. Mea culpa
>> >>>
>> >>
>> >> This is the first time I hear this - certainly you never raised this
>> >> concern on the devlist. So if you have some concerns about virtual
>> assets I
>> >> think you should raise it on the devlist, because I think everyone
>> here is
>> >> missing some conversation (or maybe it's just your private opinion that
>> >> you never shared with anyone, but maybe it's worth). I would be
>> >> interested to hear how the feature that was absolutely most successful
>> >> feature of airflow 2 was a mistake. According to the 2024 survey
>> >> https://airflow.apache.org/blog/airflow-survey-2024/  - 48% of Airflow
>> >> users have been using it, even if it was added as one of the last
>> >> big features of Airflow 2. It's the MOST used feature out of all the
>> >> features out there. I would be really curious to see how it was a
>> mistake
>> >> (but  please start a separate thread explaining why you think it was a
>> >> mistake, what are your data points and what do you think should be
>> fixed.
>> >> Just dropping "virtual assets were a mistake" in the middle of
>> multi-team
>> >> conversation seems completely unjustified without knowing what you are
>> >> talking about. So I think, until we know more, this argument has no
>> base.
>> >>
>> >>
>> >>>
>> >>> S
>> >>> To restate my points:
>> >>>
>> >>> - Sharing a deployment between teams today/in 3.1 is operationally
>> more
>> >>> complex (both scaling, and upgrades) — this is a con, not a plus.
>> >>>
>> >>
>> >> Surely. But it will be easier when AIP-72 is complete (which I am
>> >> definitely looking forward to and as clearly explained in AIP-82, is a
>> >> prerequisite of it). Nothing changed here.
>> >>
>> >>
>> >>> - The main user benefit appears to be “allow teams’ DAGs to
>> communicate
>> >>> via Assets”, in which case we can do that today by putting more work
>> in to
>> >>> AIP-82’s Asset triggers
>> >>>
>> >>
>> >> No. Lower operational complexity for multi-teams (providing that we
>> >> deliver AIP-72) is another benefit. Virtual assets is another, and
>> since
>> >> there is no ground in "virtual assets is a mistake" statement (not
>> until
>> >> you explain what you mean by that in a separate discussion) - this is
>> also
>> >> still a very valid point.
>> >>
>> >>
>> >>> Soon, we will have then be asked about cross-team governance, policy
>> >>> enforcement, and potentially unbounded edge cases (e.g., team-specific
>> >>> secrets, roles, quotas). ain, you get this for free with truely
>> separate
>> >>> deployments already
>> >>> allow different teams to use different executors (including multiple
>> >>> executors per-team following AIP-61)
>> >>>
>> >>
>> >> Not really. We very explicitly say in the AIP that his is not a goal
>> and
>> >> that we have no plans for. And yes, using separate executors per team
>> is
>> >> actually back in the AIP-82 in case you did not notice (and the code
>> needed
>> >> for it's even implemented and merged already in main by Vincent).
>> >>
>> >>
>> >>> Provably not true right now, and until ~3.2 delivers the full Task
>> >>> SDK/Core dependency separation this would be _more_ work to upgrade,
>> not
>> >>> less, and that work is not shared but still on a central team.
>> >>>
>> >>
>> >> Absolutely - we will wait for AIP-72 completion. I do not want to say
>> 3.1
>> >> or 3.2 directly - because there are - as you said - a lot of moving
>> pieces.
>> >> So my target for multi-team is "After AIP-72 is completed". Full stop.
>> But
>> >> there is nothing wrong with accepting the AIP now and doing preparatory
>> >> work in parallel. Similarly as there is no way to have a baby in 1
>> month by
>> >> 9 women, there is no way adding more effort to task-sdk isolation will
>> >> speed it up - we alredy have not only 3 people (you leading it, Kaxil
>> and
>> >> Amog) but also all the help from me and even 10s of different
>> contributors
>> >> (for example with the recent db_test cleanup that I took leadership
>> on) -
>> >> and there are people who wish to work on adding multi-team features.
>> Since
>> >> the design heavily limits impact on airflow codebase and interactions
>> with
>> >> task-sdk implementation, there is nothing wrong with starting
>> >> implementation in parallel either- amazon team is keen to move it
>> forward -
>> >> they even already implemented SQS trigger for assets, and we are
>> working
>> >> together on FAB removal, Keycloak authentication manager - and they
>> seem to
>> >> still have capacity and drive to progress multi-team. So I am not sure
>> if
>> >> we are trading off something. There is no "if we work on more on task
>> sdk
>> >> and drop multi-team things will be faster". Generally in open source
>> people
>> >> work in the area where they feel they can provide best value - such as
>> you
>> >> working on task-sdk, me on CI,dev env, they will deliver more value on
>> >> multi-team
>> >>
>> >>
>> >>>
>> >>> So please, as succinctly as possible, please tell me what the direct
>> >>> benefit to users this proposal is over us putting this effort in to
>> writing
>> >>> better Asset triggers instead?
>> >>>
>> >>
>> >>
>> >> * less operational overhead for managing multi-team (once AIP-72 is
>> >> complete) where separate execution environments are important
>> >> * virtual assets sharing
>> >> * ability of having "admin" and "team sharing" capability where dags
>> from
>> >> multiple teams can be seen in a single Airflow UI  (requires custom
>> RBAC)
>> >>
>> >> None of this can be done via beter asset triggers
>> >>
>> >>
>> >>>
>> >>> > On 23 Jun 2025, at 10:57, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>> >
>> >>> > My counter-points:
>> >>> >
>> >>> >
>> >>> >> 1. Managing a multi team deployment is not materially different
>> from
>> >>> >> managing a deployment per team
>> >>> >>
>> >>> >
>> >>> > It's a bit easier - especially when it comes to upgrades (especially
>> >>> in the
>> >>> > case we are targetting when we are not targetting multi-tenant, but
>> >>> several
>> >>> > relatively closely cooperating teams with different dependncy
>> >>> requiremens
>> >>> > and isolation need.
>> >>> >
>> >>> > 2. The database changes were quite wide-reaching
>> >>> >>
>> >>> >
>> >>> > Yes. that is addressed.
>> >>> >
>> >>> >
>> >>> >> 3. I don’t believe the original AIP (again, I haven’t read the
>> updated
>> >>> >> proposal or recent messages on the thread. yet) will meet what many
>> >>> users
>> >>> >> want out of a multiteam solution
>> >>> >>
>> >>> >
>> >>> > I think we will only see when we try. A lot of people thing they
>> would,
>> >>> > even if they are warned. I know at least one user (Wealthsimple) who
>> >>> > definitely want to use it and they got a very detailed explanation
>> of
>> >>> the
>> >>> > idea and understand it well. So I am sure that **some** users would.
>> >>> But we
>> >>> > do not know how many.
>> >>> >
>> >>> >
>> >>> >> To expand on those points a bit more
>> >>> >>
>> >>> >> On 1. The only components that are shared are, I think, the
>> scheduler
>> >>> and
>> >>> >> the API server, and it’s arguable if that is actually a good idea
>> >>> given
>> >>> >> those are likely to be the most performance sensitive components
>> >>> anyway.
>> >>> >>
>> >>> >> Additionally the fact that the scheduler is a shared component
>> makes
>> >>> >> upgrading it almost a non starter as you would likely need buy-in,
>> >>> changes,
>> >>> >> and testing form ALL teams using it. I’d argue that this is a huge
>> >>> negative
>> >>> >> until we finish off the version indepence work of AIP-72.
>> >>> >>
>> >>> >
>> >>> > Quite disagree here - especially that our target is that task-sdk is
>> >>> > supposed to provide all isolation that is needed. There should be 0
>> >>> changes
>> >>> > in the dags needed to upgrade scheduler, api_server, triggerer -
>> >>> precisely
>> >>> > because we introduced backwards-compatible task-sdk.
>> >>> >
>> >>> > On 3 my complaint is essentially that this doesn’t go nearly far
>> >>> enough. It
>> >>> >> doesn’t allow read only views to other teams dags. I don’t think it
>> >>> allows
>> >>> >> you to be in multiple teams at once. You can’t share a connection
>> >>> between
>> >>> >> teams but only allow certain specified dags to access it, but would
>> >>> have to
>> >>> >> either be globally usable, or duplicated-and-kept-in-sync between
>> >>> teams. In
>> >>> >> short I think it fall short of being useful..
>> >>> >>
>> >>> >
>> >>> > Oh absolutely all that is possible (except sharing single
>> connections
>> >>> > between multiple teams - which is a very niche use cases and
>> >>> duplication
>> >>> > here is perfectly ok as first approximation - and if we need more we
>> >>> can
>> >>> > add it later).
>> >>> >
>> >>> > Auth manager RBAC and access is abstracted away, and the Keyclock
>> >>> Manager
>> >>> > implemented by Vincent allows to manage completely independent and
>> >>> separate
>> >>> > RBAC based on arguments and resources provided by Airflow. There is
>> >>> nothing
>> >>> > to prevent the user who configures KeyCloak RBAC to define it in the
>> >>> way:
>> >>> >
>> >>> > if group a > allow to read a and write b
>> >>> > if group b > alllow to write b but not a
>> >>> >
>> >>> > and any other combinations. KeyCloak implementation - pretty
>> advanced
>> >>> > already - (and design of auth manager) completely abstracts away
>> both
>> >>> > authentication and authorization to KeyCloak and KeyCloak has RBAC
>> >>> > management built in. Also any of the users can write their own -
>> even
>> >>> > hard-coded authentication manager to do the same if they do not
>> want to
>> >>> > have configurable KeyCloak. Even SimpleAuthManager could be
>> hard-coded
>> >>> to
>> >>> > provide thiose features.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> So on the surface, I’m no more in favour of using dag bundle as a
>> >>> >> replacement for team id as I think most of the above points still
>> >>> stand.
>> >>> >>
>> >>> >
>> >>> > We disagree here.
>> >>> >
>> >>> >>
>> >>> >> My counter proposal: We do _nothing_ to core airflow. We work on
>> >>> improving
>> >>> >> the event-based trigger o fdags (write more triggers for read/check
>> >>> remote
>> >>> >> Assets etc) so that teams can have 100% isolated deployments but
>> still
>> >>> >> trigger dags based on asset events from other teams.
>> >>> >>
>> >>> >
>> >>> > That does not solve any of the other design goals - only allows to
>> >>> trigger
>> >>> > assets a bit more easily (but also it's not entirely solved by
>> AIP-82
>> >>> > because it does not solve virtual assets - only ones that have
>> defined
>> >>> > triggerer and "something" to listen on - which is way more complex
>> than
>> >>> > just defining asset in a Dag and using it in another). I think we
>> can't
>> >>> > compare AIP-82 to sharing virtual assets due to complexity of it. I
>> >>> > explained it in the doc.
>> >>> >
>> >>> >
>> >>> > I will now go and catch up with the long thread and updated proposal
>> >>> and
>> >>> >> come back.
>> >>> >>
>> >>> >
>> >>> > Please. I hope the above explaination will help in better
>> >>> understanding of
>> >>> > the proposal, because I think you had some assumptions that do not
>> >>> hold any
>> >>> > more with the new proposal.
>> >>> >
>> >>> > J.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >>> On 23 Jun 2025, at 05:54, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>> >>>
>> >>> >>> Just to clarify the relation - I updated the AIP now to refer to
>> >>> AIP-82
>> >>> >> and
>> >>> >>> to explain relation between the "cross-team" and "cross-airflow"
>> >>> asset
>> >>> >>> triggering - this is what I added:
>> >>> >>>
>> >>> >>> Note that there is a relation between AIP-82 ("External Driven
>> >>> >> Scheduling")
>> >>> >>> and this part of the functionality. When you have multiple
>> instances
>> >>> of
>> >>> >>> Airflow, you can use shared datasets - "Physical datasets" - that
>> >>> several
>> >>> >>> Airflow Instances can use - for example there could be an S3
>> object
>> >>> that
>> >>> >> is
>> >>> >>> produced by one airflow instance, and consumed by another. That
>> >>> requires
>> >>> >>> deferred trigger to monitor for such datasets, and appropriate
>> >>> >> permissions
>> >>> >>> to the external dataset, and you could achive similar result to
>> >>> >> cross-team
>> >>> >>> dataset triggering (but cross airflow). However the feature of
>> >>> sharing
>> >>> >>> datasets between the teams also works for virtual assets, that do
>> not
>> >>> >> have
>> >>> >>> physically shared "objects" and trigger that is monitoring for
>> >>> changes in
>> >>> >>> such asset.
>> >>> >>>
>> >>> >>> J.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Mon, Jun 23, 2025 at 6:38 AM Jarek Potiuk <ja...@potiuk.com>
>> >>> wrote:
>> >>> >>>
>> >>> >>>>> From a quick glance, the updated AIP didn't seem to have any
>> >>> reference
>> >>> >> to
>> >>> >>>>> AIP-82, which surprised me, but will take a more detailed read
>> >>> through.
>> >>> >>>>
>> >>> >>>> Yep. It did not - because I did not think it was needed or even
>> very
>> >>> >>>> important after the simplifications. AIP-82 has a different
>> scope,
>> >>> >> really.
>> >>> >>>> It only helps when the Assets are "real" data files which we have
>> >>> >> physical
>> >>> >>>> triggers for, it's slightly related - sharing datasets between
>> teams
>> >>> >>>> (including those that do not require physical files and
>> triggers) is
>> >>> >> still
>> >>> >>>> possible in the design we have now, but it's not (and never was)
>> the
>> >>> >>>> **only** reason for having multi-team. There always was (and
>> still
>> >>> is)
>> >>> >> the
>> >>> >>>> possibility of having a common, distinct environments (i.e.
>> >>> dependencies
>> >>> >>>> and providers) per team, the possibility of having connections
>> and
>> >>> >>>> variables that are only accessible to one team and not the other,
>> >>> and
>> >>> >>>> isolating workload execution (all that while allowing to manage
>> >>> multiple
>> >>> >>>> team and schedule things with single deployment). That did not
>> >>> change.
>> >>> >> What
>> >>> >>>> changed a lot is that it is now way simpler, something that we
>> can
>> >>> >>>> implement without heavy changes to the codebase - and give it to
>> our
>> >>> >> users,
>> >>> >>>> so that they can assess if this is something they need without
>> too
>> >>> much
>> >>> >>>> risk and effort.
>> >>> >>>>
>> >>> >>>> This was - I believe the main concern, that the value we get from
>> >>> it is
>> >>> >>>> not dramatic, but the required changes are huge. This "redesign"
>> >>> changes
>> >>> >>>> the equation - the value is still unchanged, but the cost of
>> >>> >> implementing
>> >>> >>>> it and impact on the Airflow codebase is much smaller. I still
>> have
>> >>> not
>> >>> >>>> heard back from Ash if my proposal responds to his original
>> concern
>> >>> >> though,
>> >>> >>>> so I am mostly guessing (also based on the positive impact of
>> >>> others)
>> >>> >> that
>> >>> >>>> yes it does. But to be honest I am not sure and I would love to
>> hear
>> >>> >> back,
>> >>> >>>> I decided to update the AIP to reflect it - regardless, because I
>> >>> think
>> >>> >> the
>> >>> >>>> simplification I proposed keeps the original goals, but is indeed
>> >>> way
>> >>> >>>> simpler.
>> >>> >>>>
>> >>> >>>>> This is a very difficult thread to catch up on.
>> >>> >>>>
>> >>> >>>> Valid point. Let me summarize what is the result:
>> >>> >>>>
>> >>> >>>> * I significantly simplified the implementation proposal
>> comparing
>> >>> to
>> >>> >> the
>> >>> >>>> original version
>> >>> >>>> * main simplification is very limited impact on existing
>> database -
>> >>> >>>> without "ripple effect" that would require us to change a lot of
>> >>> tables,
>> >>> >>>> including their primary keys, and heavily impact the UI
>> >>> >>>> * this is now more of an incremental change that can be
>> implemented
>> >>> way
>> >>> >>>> faster and with far less risk
>> >>> >>>> * updated idea is based on leveraging bundles (already part of
>> our
>> >>> data
>> >>> >>>> model) to map them (many-to-one) to a team - which requires to
>> just
>> >>> >> extend
>> >>> >>>> the data model with bundle mapping and add team_id to connections
>> >>> and
>> >>> >>>> variables. Those are all needed DB changes.
>> >>> >>>>
>> >>> >>>> The AIP is updated - in a one single big change so It should be
>> >>> easy to
>> >>> >>>> compare the changes:
>> >>> >>>>
>> >>> >>
>> >>>
>> https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?pageId=294816378
>> >>> >>>> -> I even named the version appropriately "Simplified multi-team
>> >>> AIP" -
>> >>> >> you
>> >>> >>>> can select and compare v.65 with v.66 to see the exact
>> differences I
>> >>> >>>> proposed.
>> >>> >>>>
>> >>> >>>> I hope it will be helpful to catch up and for those who did not
>> >>> follow,
>> >>> >> to
>> >>> >>>> be able to make up their minds about it.
>> >>> >>>>
>> >>> >>>> J.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Mon, Jun 23, 2025 at 4:35 AM Vikram Koka
>> >>> >> <vik...@astronomer.io.invalid>
>> >>> >>>> wrote:
>> >>> >>>>
>> >>> >>>>> This is a very difficult thread to catch up on.
>> >>> >>>>> I will take a detailed look at the AIP update to try to figure
>> out
>> >>> the
>> >>> >>>>> changes in the proposal.
>> >>> >>>>>
>> >>> >>>>> From a quick glance, the updated AIP didn't seem to have any
>> >>> reference
>> >>> >> to
>> >>> >>>>> AIP-82, which surprised me, but will take a more detailed read
>> >>> through.
>> >>> >>>>>
>> >>> >>>>> Vikram
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> On Sun, Jun 22, 2025 at 1:44 AM Pavankumar Gopidesu <
>> >>> >>>>> gopidesupa...@gmail.com>
>> >>> >>>>> wrote:
>> >>> >>>>>
>> >>> >>>>>> Thanks Jarek, that's a great update on this AIP, now it's much
>> >>> more
>> >>> >> slim
>> >>> >>>>>> down.
>> >>> >>>>>>
>> >>> >>>>>> left a minor comment. :) Overall looking great.
>> >>> >>>>>>
>> >>> >>>>>> Pavan
>> >>> >>>>>>
>> >>> >>>>>> On Sat, Jun 21, 2025 at 3:10 PM Jens Scheffler
>> >>> >>>>> <j_scheff...@gmx.de.invalid
>> >>> >>>>>>>
>> >>> >>>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>>> Thanks for the rework/update of the AIP-72!
>> >>> >>>>>>>
>> >>> >>>>>>> Just a few small comments but overall I like it as it is much
>> >>> leaner
>> >>> >>>>>>> than originally planned and is in a level of complexity that
>> it
>> >>> >> really
>> >>> >>>>>>> seems to be a benefit to close the gap as described.
>> >>> >>>>>>>
>> >>> >>>>>>> On 21.06.25 14:52, Jarek Potiuk wrote:
>> >>> >>>>>>>> I updated the AIP - including architecture images and
>> reviewed
>> >>> it
>> >>> >>>>>> (again)
>> >>> >>>>>>>> and corrected any ambiguities and places where it needed to
>> be
>> >>> >>>>> changed.
>> >>> >>>>>>>>
>> >>> >>>>>>>> I think the current state
>> >>> >>>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>
>> >>> >>>>>
>> >>> >>
>> >>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components
>> >>> >>>>>>>> - nicely describes the proposal.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Comparing to the previous one:
>> >>> >>>>>>>>
>> >>> >>>>>>>> 1. The DB changes are far less intrusive - no ripple effect
>> on
>> >>> >>>>> Airflow
>> >>> >>>>>>>> 2. There is no need to merge configurations and provide
>> >>> different
>> >>> >>>>> set
>> >>> >>>>>> of
>> >>> >>>>>>>> configs per team - we can add it later but I do not see why
>> we
>> >>> need
>> >>> >>>>> it
>> >>> >>>>>> in
>> >>> >>>>>>>> this simplified version
>> >>> >>>>>>>> 3. We can still configure a different set of executors per
>> team
>> >>> -
>> >>> >>>>> that
>> >>> >>>>>> is
>> >>> >>>>>>>> already implemented (we just need to wire it to the bundle ->
>> >>> team
>> >>> >>>>>>> mapping).
>> >>> >>>>>>>>
>> >>> >>>>>>>> I think it will be way simpler and faster to implement this
>> way
>> >>> and
>> >>> >>>>> it
>> >>> >>>>>>>> should serve as MVMT -> Minimum Viable Multi Team that we can
>> >>> give
>> >>> >>>>> our
>> >>> >>>>>>>> users so that they can provide feedback.
>> >>> >>>>>>>>
>> >>> >>>>>>>> J.
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Fri, Jun 20, 2025 at 8:33 AM Jarek Potiuk <
>> ja...@potiuk.com>
>> >>> >>>>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>> I like this iteration a bit more now for sure, thanks for
>> >>> being
>> >>> >>>>>>> receptive
>> >>> >>>>>>>>>> to feedback! :)
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>> This now becomes quite close to what was proposing before,
>> we
>> >>> now
>> >>> >>>>>> again
>> >>> >>>>>>>>>> have a team ID (which I think is really needed here, glad
>> to
>> >>> see
>> >>> >>>>> it
>> >>> >>>>>>> back)
>> >>> >>>>>>>>>> and it will be used for auth management, configuration
>> >>> >>>>> specification,
>> >>> >>>>>>> etc
>> >>> >>>>>>>>>> but will be carried by Bundle instead of the dag model.
>> Which
>> >>> as
>> >>> >>>>> you
>> >>> >>>>>>> say
>> >>> >>>>>>>>>> “For that we will need to make sure that both api-server,
>> >>> >>>>> scheduler
>> >>> >>>>>> and
>> >>> >>>>>>>>>> triggerer have access to the "bundle definition" (to
>> perform
>> >>> the
>> >>> >>>>>>> mapping)"
>> >>> >>>>>>>>>> which honestly doesn’t feel too much different from the
>> >>> original
>> >>> >>>>>>> proposal
>> >>> >>>>>>>>>> we had last week of adding it to Dag table and ensuring
>> it’s
>> >>> >>>>>> available
>> >>> >>>>>>>>>> everywhere. but either way I’m happy to meet in the middle
>> and
>> >>> >>>>> keep
>> >>> >>>>>> it
>> >>> >>>>>>> on
>> >>> >>>>>>>>>> Bundle if everyone else feels that’s a more suitable
>> location.
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>> I think the big difference is the "ripple effect" that was
>> >>> >>>>> discussed
>> >>> >>>>>> in
>> >>> >>>>>>>>>
>> >>> https://lists.apache.org/thread/78vndnybgpp705j6sm77l1t6xbrtnt5c
>> >>> >>>>>> (and I
>> >>> >>>>>>>>> believe - correct me if I am wrong Ash - important trigger
>> for
>> >>> the
>> >>> >>>>>>>>> discussion) so far what we wanted is to extend the primary
>> key
>> >>> and
>> >>> >>>>> it
>> >>> >>>>>>> would
>> >>> >>>>>>>>> ripple through all the pieces of Airflow -> models, API, UI
>> >>> etc.
>> >>> >>>>> ...
>> >>> >>>>>>>>> However - we already have `bundle_name" and "bundle_version"
>> >>> in the
>> >>> >>>>>> Dag
>> >>> >>>>>>>>> model. So I think when we add a separate table where we map
>> the
>> >>> >>>>> bundle
>> >>> >>>>>>> to
>> >>> >>>>>>>>> the team, the "ripple effect" will be almost 0. We do not
>> want
>> >>> to
>> >>> >>>>>> change
>> >>> >>>>>>>>> primary key, we do not want to change UI in any way (except
>> >>> >>>>> filtering
>> >>> >>>>>> of
>> >>> >>>>>>>>> DAGs available based on your team - but that will be
>> handled in
>> >>> >>>>> Auth
>> >>> >>>>>>>>> Manager and will not impact UI in any way, I think that's a
>> >>> huge
>> >>> >>>>>>>>> simplification of the implementation, and if we agree to it
>> - i
>> >>> >>>>> think
>> >>> >>>>>> it
>> >>> >>>>>>>>> should speed up the implementation significantly. There are
>> >>> only a
>> >>> >>>>>>> limited
>> >>> >>>>>>>>> number of times where you need to look up the team_id - so
>> >>> having
>> >>> >>>>> the
>> >>> >>>>>>>>> bundle -> team mapping in a separate table and having to
>> look
>> >>> them
>> >>> >>>>> up
>> >>> >>>>>>>>> should not be a problem. And it has much less complexity and
>> >>> >>>>>>>>> "ripple-effect" through the codebase (for example I could
>> >>> imagine
>> >>> >>>>> 100s
>> >>> >>>>>>> or
>> >>> >>>>>>>>> thousands already written tests that would have to be
>> adapted
>> >>> if we
>> >>> >>>>>>> changed
>> >>> >>>>>>>>> the primary key - where there will be pretty much zero
>> impact
>> >>> on
>> >>> >>>>>>> existing
>> >>> >>>>>>>>> tests if we just add bundle -> team lookup table.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>> One other thing I’d point out is that I think including
>> >>> executors
>> >>> >>>>> per
>> >>> >>>>>>>>>> team is a very easy win and quite possible without much
>> work.
>> >>> I
>> >>> >>>>>> already
>> >>> >>>>>>>>>> have much of the code written. Executors are already aware
>> of
>> >>> >>>>> Teams
>> >>> >>>>>>> that
>> >>> >>>>>>>>>> own them (merged), I have a PR open to have configuration
>> per
>> >>> team
>> >>> >>>>>>> (with a
>> >>> >>>>>>>>>> quite simple and isolated approach, I believe you approved
>> >>> Jarek).
>> >>> >>>>>> The
>> >>> >>>>>>> last
>> >>> >>>>>>>>>> piece is updating the scheduling logic to route tasks from
>> a
>> >>> >>>>>> particular
>> >>> >>>>>>>>>> Bundle to the correct executor, which shouldn’t be much
>> work
>> >>> >>>>> (though
>> >>> >>>>>> it
>> >>> >>>>>>>>>> would be easier if the Task models had a column for the
>> team
>> >>> they
>> >>> >>>>>>> belong
>> >>> >>>>>>>>>> to, rather than having to look up the Dag and Bundle to get
>> >>> the
>> >>> >>>>>> team) I
>> >>> >>>>>>>>>> have a branch where I was experimenting with this logic
>> >>> already.
>> >>> >>>>>>>>>> Any who, long story short, I don’t think we necessarily
>> need
>> >>> to
>> >>> >>>>>> remove
>> >>> >>>>>>>>>> this piece from the project's scope if it is already partly
>> >>> done
>> >>> >>>>> and
>> >>> >>>>>>> not
>> >>> >>>>>>>>>> too difficult.
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>> Yeah. I hear you here again. Certainly I would not want to
>> just
>> >>> >>>>>>>>> **remove** it from the code. And, yep I totally forgot we
>> have
>> >>> it
>> >>> >>>>> in.
>> >>> >>>>>>> And
>> >>> >>>>>>>>> if we can make it in, easily (which it seems we can) - we
>> can
>> >>> also
>> >>> >>>>>>> include
>> >>> >>>>>>>>> it in the first iteration. What I wanted to avoid really
>> (from
>> >>> the
>> >>> >>>>>>> original
>> >>> >>>>>>>>> design) - again trying to simplify it, limit the changes,
>> and
>> >>> >>>>> speed up
>> >>> >>>>>>>>> implementation. And there is one "complexity" that I wanted
>> to
>> >>> >>>>> avoid
>> >>> >>>>>>>>> specifically - having to have separate , additional
>> >>> configuration
>> >>> >>>>> per
>> >>> >>>>>>> team.
>> >>> >>>>>>>>> Not only because it complicates already complex
>> configuration
>> >>> >>>>> handling
>> >>> >>>>>>> (I
>> >>> >>>>>>>>> know we have PR for that) but mostly because if it is not
>> >>> needed,
>> >>> >>>>> we
>> >>> >>>>>> can
>> >>> >>>>>>>>> simplify documentation and explain to our users easier what
>> >>> they
>> >>> >>>>> need
>> >>> >>>>>>> to do
>> >>> >>>>>>>>> to have their own multi-team setup. And I am quite open to
>> >>> keeping
>> >>> >>>>>>>>> multiple-executors if we can avoid complicating
>> configuration.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> But I think some details of that and whether we really need
>> >>> >>>>> separate
>> >>> >>>>>>>>> configuration might also come as a result of updating the
>> AIP
>> >>> - I
>> >>> >>>>> am
>> >>> >>>>>> not
>> >>> >>>>>>>>> quite sure now if we need it, but we can discuss it when we
>> >>> >>>>> iterate on
>> >>> >>>>>>> the
>> >>> >>>>>>>>> AIP.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> J.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> ---------------------------------------------------------------------
>> >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> >>> >>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>
>> >>> >>>>>
>> >>> >>>>
>> >>> >>
>> >>> >>
>> >>> >>
>> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> >>> >> For additional commands, e-mail: dev-h...@airflow.apache.org
>> >>>
>> >>>
>>
>

Re: Discuss: AIP-67 (multi team) now that AIP-82 (External event driven dags) exists

Reply via email to