Re: Discuss: AIP-67 (multi team) now that AIP-82 (External event driven dags) exists

Jarek Potiuk Fri, 13 Jun 2025 23:22:14 -0700

On Fri, Jun 13, 2025 at 7:09 PM Vincent Beck <vincb...@apache.org> wrote:


> Thanks, Jarek, for this proposal. Overall, I really like it—it
> significantly simplifies multi-team support and removes the need to deploy
> additional components per team, without compromising on the core needs of
> users (unless I’m missing something).
>

Yep. I think with this "design" iteration, I put "simplicity" and
"maintainability" as the primary goal. Separate configuration per tem goes
out the window, ripple-effect on the DB goes out the window, what's left is
basically the same Airflow we already have with few modifications.


> > And if we do it and implement packaging and execution environments (say
> ability of choosing predefined venv to parse and execute DAGs coming from a
> specific bundle_id - the expectation 2) above can be handled well.
>
> Could you elaborate on this part? I’m not entirely clear on how it would
> work in practice. For instance, how would it behave with two teams or
> bundles? Real-world examples would help clarify this, unless it's more
> implementation details that we can flesh out once there's agreement on the
> general approach.
>

Currently with Bundle definition we **just** define where the DAGs are
coming from. But we could (and that was even part of the original design)
add extra "execution environment" configuration. For example when we have
bundle_a and bundle_b each of them could have separate "environment"
specified (say env_a, env_b) and we could map such environment to specific
image (image_a, image_b) or virtualenv in the same image (/venv/a/ ,
/venv/b) that would be predefined in the processor/worker images. (or in
VMs if images are not used). The envs might have different sets of
dependencies (providers and others) installed, and both DAG processor
parsing and "Worker" (in celery or k8s Pod) would be run using that
environment. Initially AIP-67 also discussed defining dependencies in
bundle and installing it dynamically (like Python Venv Operator) - but I
think personally having predefined set of environments rather than
dynamically creating (like ExternalPythonOperator) has much better
maintainability, stability and security properties.


> Also, what about the triggerer? Since the triggerer runs user code, the
> original AIP-67 proposal required at least one triggerer per team. How
> would that be handled under this new architecture?
>

That is an excellent question :) .  There are few options - depending on
how much of the point 4) "isolating workload" we want to implement.
Paradoxically - to be honest-  for me, the Triggerer always had the
potential of being less of a problem when it comes to isolation. Yes. All
triggers are (currently) running not only in the same interpreter, but also
in the same event loop (which means that isolation goes out of the window),
but also it's relatively easy to introduce the isolation and we've been
discussing options about it in the past as well. I see quite a few.

Option 1) simplest operationally - We could add a mode in the Airflow that
would resemble Timetables. All Triggers would have to be exposed via the
plugin interface (we could easily expose all triggers this way from all our
providers in a bulk way). This means that deployment manager will have
control on what is run in Triggerrer - effectively limiting it similarly as
Scheduler code today. That would prevent some of the other cases we
discussed recently (such as sending "notification" serialized methods to
triggerer to execute), but that's mostly optimization, and they could be
sent as worker tasks instead in this case).

Option 2) Semi-isolation - for a number of our users just separating
processes might be "enough" (especially if we add cgroups to isolate
the processes - we had that in the past). Not "perfect" and does not have
all security properties, but for a number of our users it might be "good
enough" because they will trust their teams enough to not worry about
potential "malicious actions". In this case a single Triggerrer could run
several event loops - one per bundle, each of them in a separate, isolated
- process and the only change that we would have to do is to route the
triggers to the right loop based on bundle id. Almost no operational
complexity increases, but isolation is greatly improved. Again following
the bundle -> environment mapping each of those processes could be run
using a specific "per-bundle" environment where all necessary dependencies
would be installed. And here the limit of arbitrary code execution coming
from DAG can be lifted.

Option 3) Full isolation -> simply run one triggerer per bundle. That is a
bit more like the original proposal, because we will then have an extra
triggerer for each bundle/team (or group of bundles - it does not have to
be 1-to-1 mapping, could be many-to-1) . But it should provide full
"security" properties with isolation and separation of workload, each
triggerer could run completely in the same environment as defined in the
bundle. It increases operational complexity - but just a bit. Rainbows and
unicorns - we have it all.

Also one more thing.

We usually discuss technical aspects here in develist and rarely talk
about "business". But I think this is in some cases wrong - including a
multi-team that has the potential of either supporting or undermining some
of the business our stakeholders do with Airflow.

I would like to - really - make a collaborative effort to come up with a
multi-team approach with all the stakeholders here - Amazon, Google,
Astronomer especially should all be on-board with it. We know our users
need it (survey and a number of talks about multi-team/tenancy that were
submitted this year for Summit speak for themselves - we had ~10 sessions
submitted about it this year, 30% of survey respondents want it - though of
course as Ash correctly pointed out many of those people have different
expectations). Again multi-team has the potential of either killing or
supporting some of the business models our stakeholders might implement in
their offering. And while here we do not "care" too much about those
models, we should care about our stakeholders sustainability  - as they are
the ones who are fueling Airflow in many ways - so it would be stupid if we
do not consider their expectations and needs and - yes - sustainability of
their businesses. Here in the community we mostly add features that can be
used by everyone - whether in "as a service" or "on-prem" environment.  And
we cannot "know" what the business is being planned or is possible or is
good for our stakeholders. But we can collaboratively design the feature
that might be usable on-prem - but one that we know is good for everyone
and they can continue making business (or even better - provide better
offerings to their users building on top of it.

Let's do it. If there are things we can improve/make better here, I want to
hear - from everyone - Ash, Vikram, Raj, Vincent, Rafał, Michał - if there
is any idea how to improve it and make it better also for you - I think
it's a good time to discuss it.

J.

Re: Discuss: AIP-67 (multi team) now that AIP-82 (External event driven dags) exists

Reply via email to