Hi all:

As part of the work on AIP-72 (Task Execution interface and SDK) and I’d like 
to propose a slight modification to the code layout inside our main repo. More 
details can be found in the slack thread 
<https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1724070687438019>. 

I will discuss more details about this on the dev call later today, and will 
follow up with more details on this thread too, but I wanted to get something 
out before the call to at least give people a chance to have informed discussion

The main thinking here is that “code that is released together lives together”. 
As a reminder, one of the key motivations or goals of AIP-72 was: 

> Ensure that this interface reduces the interaction between the code running 
> within the Task and the rest of Airflow. This is to address unintended ripple 
> effects from core Airflow changes which has caused numerous Airflow upgrade 
> issues, because Task (i.e. DAG) code relied on Core Airflow abstractions. 
> This has been a common problem pointed out by numerous Airflow users 
> including early adopters. This proposal would enable Airflow users to upgrade 
> Airflow system components (Scheduler, et al), without impacting DAG user code.

And thinking about this was the reason for starting this discussion. In short, 
user DAG code won't import anything from the apache-airflow-core (distribution 
name merely a place holder here!), and it should be possible to run a worker 
without this python distribution even installed. Instead something like 
`apache-airflow-task-sdk`/`airflow-public-sdk` would be installed. (And the 
dist name can be totally independent from what modules DAG files imports things 
from. That is for a separate discussion)

This is not as simple as just moving the code and calling it a day, a lot of 
the CI and build pipelines will need updating to cope. I personally still think 
this new layout is worth the effort. 

## The providers get “extracted” from the current location to a top-level 
providers/ folder:

providers/airflow/providers/celery/provider.yaml
providers/airflow/providers/celery/__init__.py
providers/airflow/providers/cncf/kubernetes/provuder.yaml
providers/airflow/providers/cncf/kubernetes/__init__.py

(At a later step we will look at changing the providers structure a bit more, 
but that is not part of this lazy consensus as it requires some larger changes 
to the CI pipelines etc)

providers/celery/pyproject.toml
providers/celery/src/airflow/providers/celery/__init__.py
providers/cncf-kubernetetes/pyproject.toml
providers/cncf-kubernetetes/src/airflow/providers/cncf/kubernetes/__init__.py

## The current Airflow directory (which contains the scheduler, web server, 
worker etc components and CLI, excluding providers) gets moved to a “core” 
folder:

core/pyproject.toml/
core/src/airflow/__init__.py
core/src/airflow/models/…
core/src/airflow/api_connexion/…
core/src/airflow/www/package.json # React UI etc.

## We create a new top level folder for the new Task Execution Interface (this 
is the new dist for AIP-72, and it contains all the DAG parse/execution time 
things outside of providers:

task-sdk/pyproject.toml
task-sdk/src/airflow_task_sdk/__init__.py
task-sdk/src/airflow_task_sdk/defintions/dag.py
task-sdk/src/airflow_task_sdk/defintions/task_group.py


The split between core/ and task-sdk/ is that anything to do with parse or 
execution time lives there. To give some examples, this would include: DAG, 
TaskGroup, the TaskInstanceState enums/constants, all of the functions and code 
to do with setting relationships between tasks.

It would not include: anything to do with SQLAlchemy or Database models, and 
likely won’t include anything to do with dependency/“runablility” checking 
(that would live in core still). It is expected that core will depend upon this 
dist.

Reply via email to