That looks good as a first step of moving things. Few comments and questions:
1) I guess it should be "providers/src/airflow/providers/*" to have a similar structure (also see point 2) 2) are we (currently) ignoring the fact that we have docs, tests, and some other things also in shared directories? Or do we plan to have `core/tests", "providers/tests", "core/docs", "provider/docs" as well as part of this stage of the move? I think it would make sense to move all of that together. Could be also likely staged - sources first, then tests, then docs, but generally I think we should treat all those "together". 3) Important aspect of this move is the development workflow for changes in providers. I assume "airlfow_task_sdk" will be a top-level package, not "airflow" which will make it easier (that was a major problem so far why we could not do it currently easily and why we shared a single source tree). 4) I think in order to get 3) working - we will need some kind of workspace solution (I mentioned "uv workspace" as one of the options) - to make it easy to run tests and development together (i.e. both "providers" project and "core" project should depend on "task-sdk" project and we need to bind them together (so that when new features are added in task-sdk - both core and providers can depend on them). That should be part of the POC to try it out. We will need a POC to be able to make informed decisions there, but I do not see major issues with this proposal it as long as we replace all "from airflow import *" in all providers and change to "from airflow-task-sdk import *" and have a compatibility package for Airflow 2 so that Providers will import the right classes from "airflow" in Airflow 2 (all the definitions, BaseOperator and the like). POC before we do it where people could test it is necessary and users who are mostly contributing to providers, should also test the development workflows for them (local venv and breeze). Other than that - it looks good. J. On Thu, Aug 22, 2024 at 1:39 PM Ash Berlin-Taylor <a...@apache.org> wrote: > Hi all: > > As part of the work on AIP-72 (Task Execution interface and SDK) and I’d > like to propose a slight modification to the code layout inside our main > repo. More details can be found in the slack thread < > https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1724070687438019>. > > I will discuss more details about this on the dev call later today, and > will follow up with more details on this thread too, but I wanted to get > something out before the call to at least give people a chance to have > informed discussion > > The main thinking here is that “code that is released together lives > together”. As a reminder, one of the key motivations or goals of AIP-72 > was: > > > Ensure that this interface reduces the interaction between the code > running within the Task and the rest of Airflow. This is to address > unintended ripple effects from core Airflow changes which has caused > numerous Airflow upgrade issues, because Task (i.e. DAG) code relied on > Core Airflow abstractions. This has been a common problem pointed out by > numerous Airflow users including early adopters. This proposal would enable > Airflow users to upgrade Airflow system components (Scheduler, et al), > without impacting DAG user code. > > And thinking about this was the reason for starting this discussion. In > short, user DAG code won't import anything from the apache-airflow-core > (distribution name merely a place holder here!), and it should be possible > to run a worker without this python distribution even installed. Instead > something like `apache-airflow-task-sdk`/`airflow-public-sdk` would be > installed. (And the dist name can be totally independent from what modules > DAG files imports things from. That is for a separate discussion) > > This is not as simple as just moving the code and calling it a day, a lot > of the CI and build pipelines will need updating to cope. I personally > still think this new layout is worth the effort. > > ## The providers get “extracted” from the current location to a top-level > providers/ folder: > > providers/airflow/providers/celery/provider.yaml > providers/airflow/providers/celery/__init__.py > providers/airflow/providers/cncf/kubernetes/provuder.yaml > providers/airflow/providers/cncf/kubernetes/__init__.py > > (At a later step we will look at changing the providers structure a bit > more, but that is not part of this lazy consensus as it requires some > larger changes to the CI pipelines etc) > > providers/celery/pyproject.toml > providers/celery/src/airflow/providers/celery/__init__.py > providers/cncf-kubernetetes/pyproject.toml > > providers/cncf-kubernetetes/src/airflow/providers/cncf/kubernetes/__init__.py > > ## The current Airflow directory (which contains the scheduler, web > server, worker etc components and CLI, excluding providers) gets moved to a > “core” folder: > > core/pyproject.toml/ > core/src/airflow/__init__.py > core/src/airflow/models/… > core/src/airflow/api_connexion/… > core/src/airflow/www/package.json # React UI etc. > > ## We create a new top level folder for the new Task Execution Interface > (this is the new dist for AIP-72, and it contains all the DAG > parse/execution time things outside of providers: > > task-sdk/pyproject.toml > task-sdk/src/airflow_task_sdk/__init__.py > task-sdk/src/airflow_task_sdk/defintions/dag.py > task-sdk/src/airflow_task_sdk/defintions/task_group.py > > > The split between core/ and task-sdk/ is that anything to do with parse or > execution time lives there. To give some examples, this would include: DAG, > TaskGroup, the TaskInstanceState enums/constants, all of the functions and > code to do with setting relationships between tasks. > > It would not include: anything to do with SQLAlchemy or Database models, > and likely won’t include anything to do with dependency/“runablility” > checking (that would live in core still). It is expected that core will > depend upon this dist.