Okay, so it seems we have agreement on the approach here, so I’ll continue with this, and on the dev call it was mentioned that “airflow-common” wasn’t a great name, so here is my proposal for the file structure;
``` / task-sdk/... airflow-core/... shared/ kuberenetes/ pyproject.toml src/ airflow_kube/__init__.py core-and-tasksdk/ pyproject.toml src/ airflow_shared/__init__.py ``` Things to note here: the “shared” folder has (the possibility) of having multiple different shared “libraries” in it, in this example I am supposing a hypothetical shared kuberenetes folder a world in which we split the KubePodOperator and the KubeExecutor in to two separate distributions (example only, not proposing we do that right now, and that will be a separate discussion) The other things to note here: - the folder name in shared aims to be “self-documenting”, hence the verbose “core-and-tasksdk” to say where the shared library is intended to be used. - the python module itself should almost always have an `airflow_` (or maybe `_airflow_`?) prefix so that it does not conflict with anything else we might use. It won’t matter “in production” as those will be vendored in to be imported as `airflow/_vendor/airflow_shared` etc, but avoiding conflicts at dev time with the Finder approach is a good safety measure. I will start making a real PR for this proposal now, but I’m open to feedback (either here, or in the PR when I open it) -ash > On 4 Jul 2025, at 16:55, Jarek Potiuk <ja...@potiuk.com> wrote: > > Yeah we have to try it and test - also building packages happens semi > frequently when you run `uv sync` (they use some kind of heuristics to > decide when) and you can force it with `--reinstall` or `--refresh`. > Package build also happens every time when you run "ci-image build` now in > breeze so it seems like it will nicely integrate in our workflows. > > Looks really cool Ash. > > On Fri, Jul 4, 2025 at 5:14 PM Ash Berlin-Taylor <a...@apache.org> wrote: > >> It’s not just release time, but any time we build a package which happens >> on “every” CI run. The normal unit tests will use code from >> airflow-common/src/airflow_common; the kube tests which build an image will >> build the dists and vendor in the code from that commit. >> >> There is only a single copy of the shared code committed to the repo, so >> there is never anything to synchronise. >> >>> On 4 Jul 2025, at 15:53, Amogh Desai <amoghdesai....@gmail.com> wrote: >>> >>> Thanks Ash. >>> >>> This is really cool and helpful that you were able to test both scenarios >>> -- repo checkout >>> and also installing from the vendored package and the resolution worked >>> fine too. >>> >>> I like this idea compared the to relative import one for few reasons: >>> - It feels like it will take some time to adjust to the new coding >> standard >>> that we will lay >>> if we impose relative imports in the shared dist >>> - We can continue using repo wise absolute import standards, it is also >>> much easier for situations >>> when we do global search in IDE to find + replace, this could mean a >> change >>> there >>> - The vendoring work is a proven and established paradigm across projects >>> and would >>> out of box give us the build tooling we need also >>> >>> Nothing too against the relative import but with the evidence provided >>> above, vendored approach >>> seems to only do us good. >>> >>> Regarding synchronizing it, release time should be fine as long as we >> have >>> a good CI workflow to probably >>> catch such issues per PR if changes are made in shared dist? (precommit >>> would make it really slow i guess) >>> >>> If we can run our tests with vendored code we should be mostly covered. >>> >>> Good effort all! >>> >>> Thanks & Regards, >>> Amogh Desai >>> >>> >>>> On Fri, Jul 4, 2025 at 7:23 PM Ash Berlin-Taylor <a...@apache.org> >> wrote: >>>> >>>> Okay, I think I’ve got something that works and I’m happy with. >>>> >>>> >>>> >> https://github.com/astronomer/airflow/tree/shared-vendored-lib-tasksdk-and-core >>>> >>>> This produces the following from `uv build task-sdk` >>>> - >>>> >> https://github.com/user-attachments/files/21058976/apache_airflow_task_sdk-1.1.0.tar.gz >>>> - >>>> >> https://github.com/user-attachments/files/21058996/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip >>>> >>>> (`.whl.zip` as GH won't allow .whl upload, but will .zip) >>>> >>>> ``` >>>> ❯ unzip -l dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip | >> grep >>>> _vendor >>>> 50 02-02-2020 00:00 airflow/sdk/_vendor/.gitignore >>>> 2082 02-02-2020 00:00 airflow/sdk/_vendor/__init__.py >>>> 28 02-02-2020 00:00 airflow/sdk/_vendor/airflow_common.pyi >>>> 18 02-02-2020 00:00 airflow/sdk/_vendor/vendor.txt >>>> 785 02-02-2020 00:00 >>>> airflow/sdk/_vendor/airflow_common/__init__.py >>>> 10628 02-02-2020 00:00 >>>> airflow/sdk/_vendor/airflow_common/timezone.py >>>> ``` >>>> >>>> And similarly in the .tar.gz, so our “sdist” is complete too: >>>> ``` >>>> ❯ tar -tzf dist/apache_airflow_task_sdk-1.1.0.tar.gz |grep _vendor >>>> apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/.gitignore >>>> apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/__init__.py >>>> apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common.pyi >>>> apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/vendor.txt >>>> >>>> >> apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/__init__.py >>>> >>>> >> apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/timezone.py >>>> ``` >>>> >>>> The plugin works at build time by including/copying the libs specified >> in >>>> vendor.txt into place (and let `vendoring` take care of import >> rewrites.) >>>> >>>> For the imports to continue to work at “dev” time/from a repo checkout, >> I >>>> have added a import finder to `sys.meta_path`, and since it’s at the >> end of >>>> the list it will only be used if the normal import can’t find things. >>>> >>>> >>>> >> https://github.com/astronomer/airflow/blob/996817782be6071b306a87af9f36fe1cf2d3aaa3/task-sdk/src/airflow/sdk/_vendor/__init__.py >>>> >>>> This doesn’t quite give us the same runtime effect “import rewriting” >>>> affect, as in this approach `airflow_common` is directly loaded (i.e. >>>> airflow.sdk._vendor.airflow_common and airflow_common exist in >>>> sys.modules), but it does work for everything that I was able to test.. >>>> >>>> I tested it with the diff at the end of this message. My test ipython >>>> shell: >>>> >>>> ``` >>>> In [1]: from airflow.sdk._vendor.airflow_common.timezone import foo >>>> >>>> In [2]: foo >>>> Out[2]: 1 >>>> >>>> In [3]: import airflow.sdk._vendor.airflow_common >>>> >>>> In [4]: import airflow.sdk._vendor.airflow_common.timezone >>>> >>>> In [5]: airflow.sdk._vendor.airflow_common.__file__ >>>> Out[5]: >>>> >> '/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/__init__.py' >>>> >>>> In [6]: airflow.sdk._vendor.airflow_common.timezone.__file__ >>>> Out[6]: >>>> >> '/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/timezone.py' >>>> >>>> ``` >>>> >>>> >>>> And in an standalone environment with the SDK dist I built (it needed >> the >>>> matching airflow-core right now, but that is nothing to do with this >>>> discussion): >>>> >>>> ``` >>>> ❯ _AIRFLOW__AS_LIBRARY=1 uvx --python 3.12 --with >>>> dist/apache_airflow_core-3.1.0-py3-none-any.whl --with >>>> dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl ipython >>>> Python 3.12.7 (main, Oct 16 2024, 07:12:08) [Clang 18.1.8 ] >>>> Type 'copyright', 'credits' or 'license' for more information >>>> IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for help. >>>> Tip: You can use `%hist` to view history, see the options with >> `%history?` >>>> >>>> In [1]: import airflow.sdk._vendor.airflow_common.timezone >>>> >>>> In [2]: airflow.sdk._vendor.airflow_common.timezone.__file__ >>>> Out[2]: >>>> >> '/Users/ash/.cache/uv/archive-v0/WWq6r65aPto2eJOyPObEH/lib/python3.12/site-packages/airflow/sdk/_vendor/airflow_common/timezone.py’ >>>> `` >>>> >>>> >>>> >>>> ```diff >>>> diff --git a/airflow-common/src/airflow_common/__init__.py >>>> b/airflow-common/src/airflow_common/__init__.py >>>> index 13a83393a9..927b7c6b61 100644 >>>> --- a/airflow-common/src/airflow_common/__init__.py >>>> +++ b/airflow-common/src/airflow_common/__init__.py >>>> @@ -14,3 +14,5 @@ >>>> # KIND, either express or implied. See the License for the >>>> # specific language governing permissions and limitations >>>> # under the License. >>>> + >>>> +foo = 1 >>>> diff --git a/airflow-common/src/airflow_common/timezone.py >>>> b/airflow-common/src/airflow_common/timezone.py >>>> index 340b924c66..58384ef20f 100644 >>>> --- a/airflow-common/src/airflow_common/timezone.py >>>> +++ b/airflow-common/src/airflow_common/timezone.py >>>> @@ -36,6 +36,9 @@ _PENDULUM3 = >>>> version.parse(metadata.version("pendulum")).major == 3 >>>> # - FixedTimezone(0, "UTC") in pendulum 2 >>>> utc = pendulum.UTC >>>> >>>> + >>>> +from airflow_common import foo >>>> + >>>> TIMEZONE: Timezone >>>> >>>> >>>> ``` >>>> >>>>>> On 3 Jul 2025, at 12:43, Jarek Potiuk <ja...@potiuk.com> wrote: >>>>> >>>>> I think both approaches are doable: >>>>> >>>>> 1) -> We can very easily prevent bad imports by pre-commit when >> importing >>>>> from different distributions and make sure we are only doing relative >>>>> imports in the shared modules. We are doing plenty of this already. And >>>> yes >>>>> it would require relative links we currently do not allow. >>>>> >>>>> 2) -> has one disadvantage that someone at some point in time will have >>>> to >>>>> decide to synchronize this and if it happens just before release (I bet >>>>> this is going to happen) this will lead to solving problems that would >>>>> normally be solved during PR when you make a change (i.e. symbolic link >>>> has >>>>> the advantage that whoever modifies shared code will be immediately >>>>> notified in their PR - that they broke something because either static >>>>> checks or mypy or tests fail. >>>>> >>>>> Ash, do you have an idea of a process (who and when) does the >>>>> synchronisation in case of vendoring? Maybe we could solve it if it is >>>> done >>>>> more frequently and with some regularity? We could potentially force >>>>> re-vendoring at PR time as well any time shared code changes (and >> prevent >>>>> it by pre-commit. And I can't think of some place (other than releases) >>>> in >>>>> our development workflow and that seems to be a bit too late as puts an >>>>> extra effort on fixing potential incompatibilities introduced on >> release >>>>> manager and delays the release. WDYT? >>>>> >>>>> Re: relative links. I think for a shared library we could potentially >>>> relax >>>>> this and allow them (and actually disallow absolute links in the pieces >>>> of >>>>> code that are shared - again, by pre-commit). As I recall, the only >>>> reason >>>>> we forbade the relative links is because of how we are (or maybe were) >>>>> doing DAG parsing and failures resulting from it. So we decided to just >>>> not >>>>> allow it to keep consistency. The way how Dag parsing works is that >> when >>>>> you are using importlib to read the Dag from a file, the relative >> imports >>>>> do not work as it does not know what they should be relative to. But if >>>>> relative import is done from an imported package, it should be no >>>> problem, >>>>> I think - otherwise our Dags would not be able to import any library >> that >>>>> uses relative imports. >>>>> >>>>> Of course consistency might be the reason why we do not want to >> introduce >>>>> relative imports. I don't see it as an issue if it is guarded by >>>> pre-commit >>>>> though. >>>>> >>>>> J. >>>>> >>>>> >>>>> J. >>>>> >>>>> >>>>> czw., 3 lip 2025, 12:11 użytkownik Ash Berlin-Taylor <a...@apache.org> >>>>> napisał: >>>>> >>>>>> Oh yes, symlinks will work, with one big caveat: It does mean you >> can’t >>>>>> use absolute imports in one common module to another. >>>>>> >>>>>> For example >>>>>> >>>> >> https://github.com/apache/airflow/blob/4c66ebd06/airflow-core/src/airflow/utils/serve_logs.py#L41 >>>>>> where we have >>>>>> >>>>>> ``` >>>>>> from airflow.utils.module_loading import import_string >>>>>> ``` >>>>>> >>>>>> if we want to move serve_logs into this common lib that is then >>>> symlinked >>>>>> then we wouldn’t be able to have `from airflow_common.module_loading >>>> import >>>>>> import_string`. >>>>>> >>>>>> I can think of two possible solutions here. >>>>>> >>>>>> 1) is to allow/require relative imports in this shared lib, so `from >>>>>> .module_loading import import_string` >>>>>> 2) is to use `vendoring`[1] (from the pip maintainers) which will >> handle >>>>>> import-rewriting for us. >>>>>> >>>>>> I’d entirely forgot that symlinks in repos was a thing, so I prepared >> a >>>>>> minimal POC/demo of what vendoring approach could look like here >>>>>> >>>> >> https://github.com/apache/airflow/commit/996817782be6071b306a87af9f36fe1cf2d3aaa3 >>>>>> >>>>>> Now personally I am more than happy with relative imports, but >> generally >>>>>> as a project we have avoided them, so I think that limits what we >> could >>>> do >>>>>> with a symlink based approach. >>>>>> >>>>>> -ash >>>>>> >>>>>> [1] https://github.com/pradyunsg/vendoring >>>>>> >>>>>>> On 3 Jul 2025, at 10:30, Pavankumar Gopidesu < >> gopidesupa...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Thanks Ash >>>>>>> >>>>>>> Yes agree option 2 would be preferred for me. Making sure we have all >>>> the >>>>>>> gaurdriles to protect any unwanted behaviour in code sharing and >>>>>> executing >>>>>>> right of tests between the packages. >>>>>>> >>>>>>> Agree with others, option 2 would be >>>>>>> >>>>>>> On Thu, Jul 3, 2025 at 10:02 AM Amogh Desai < >> amoghdesai....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for starting this discussion, Ash. >>>>>>>> >>>>>>>> I would prefer option 2 here with proper tooling to handle the code >>>>>>>> duplication at *release* time. >>>>>>>> It is best to have a dist that has all it needs in itself. >>>>>>>> >>>>>>>> Option 1 could very quickly get out of hand and if we decide to >>>> separate >>>>>>>> triggerer / >>>>>>>> dag processor / config etc etc as separate packages, back compat is >>>>>> going >>>>>>>> to be a nightmare >>>>>>>> and will bite us harder than we anticipate. >>>>>>>> >>>>>>>> Thanks & Regards, >>>>>>>> Amogh Desai >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 3, 2025 at 1:12 AM Kaxil Naik <kaxiln...@gmail.com> >>>> wrote: >>>>>>>> >>>>>>>>> I prefer Option 2 as well to avoid matrix of dependencies >>>>>>>>> >>>>>>>>> On Thu, 3 Jul 2025 at 01:03, Jens Scheffler >>>> <j_scheff...@gmx.de.invalid >>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I'd also rather prefer option 2 - reason here is it is rather >>>>>> pragmatic >>>>>>>>>> and we no not need to cut another package and have less package >>>> counts >>>>>>>>>> and dependencies. >>>>>>>>>> >>>>>>>>>> I remember some time ago I was checking (together with Jarek, I am >>>> not >>>>>>>>>> sure anymore...) if the usage of symlinks would be possible. To >> keep >>>>>>>> the >>>>>>>>>> source in one package but "symlink" it into another. If then at >>>> point >>>>>>>> of >>>>>>>>>> packaging/release the files are materialized we have 1 set of >> code. >>>>>>>>>> >>>>>>>>>> Otherwise if not possible still the redundancy could be solved by >> a >>>>>>>>>> pre-commit hook - and in Git the files are de-duplicated anyway >>>> based >>>>>>>> on >>>>>>>>>> content hash, so this does not hurt. >>>>>>>>>> >>>>>>>>>> On 02.07.25 18:49, Shahar Epstein wrote: >>>>>>>>>>> I support option 2 with proper automation & CI - the reasonings >>>>>>>> you've >>>>>>>>>>> shown for that make sense to me. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Shahar >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jul 2, 2025 at 3:36 PM Ash Berlin-Taylor <a...@apache.org >>> >>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello everyone, >>>>>>>>>>>> >>>>>>>>>>>> As we work on finishing off the code-level separation of Task >> SDK >>>>>>>> and >>>>>>>>>> Core >>>>>>>>>>>> (scheduler etc) we have come across some situations where we >> would >>>>>>>>> like >>>>>>>>>> to >>>>>>>>>>>> share code between these. >>>>>>>>>>>> >>>>>>>>>>>> However it’s not as straight forward of “just put it in a common >>>>>>>> dist >>>>>>>>>> they >>>>>>>>>>>> both depend upon” because one of the goals of the Task SDK >>>>>>>> separation >>>>>>>>>> was >>>>>>>>>>>> to have 100% complete version independence between the two, >>>> ideally >>>>>>>>>> even if >>>>>>>>>>>> they are built into the same image and venv. Most of the reason >>>> why >>>>>>>>> this >>>>>>>>>>>> isn’t straight forward comes down to backwards compatibility - >> if >>>> we >>>>>>>>>> make >>>>>>>>>>>> an change to the common/shared distribution >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We’ve listed the options we have thought about in >>>>>>>>>>>> https://github.com/apache/airflow/issues/51545 (but that covers >>>>>>>> some >>>>>>>>>> more >>>>>>>>>>>> things that I don’t want to get in to in this discussion such as >>>>>>>>>> possibly >>>>>>>>>>>> separating operators and executors out of a single provider >> dist.) >>>>>>>>>>>> >>>>>>>>>>>> To give a concrete example of some code I would like to share >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >> https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py >>>>>>>>>>>> — logging config. Another thing we will want to share will be >> the >>>>>>>>>>>> AirflowConfigParser class from airflow.configuration (but >> notably: >>>>>>>>> only >>>>>>>>>> the >>>>>>>>>>>> parser class, _not_ the default config values, again, lets not >>>> dwell >>>>>>>>> on >>>>>>>>>> the >>>>>>>>>>>> specifics of that) >>>>>>>>>>>> >>>>>>>>>>>> So to bring the options listed in the issue here for discussion, >>>>>>>>> broadly >>>>>>>>>>>> speaking there are two high-level approaches: >>>>>>>>>>>> >>>>>>>>>>>> 1. A single shared distribution >>>>>>>>>>>> 2. No shared package and copy/duplicate code >>>>>>>>>>>> >>>>>>>>>>>> The advantage of Approach 1 is that we only have the code in one >>>>>>>>> place. >>>>>>>>>>>> However for me, at least in this specific case of Logging config >>>> or >>>>>>>>>>>> AirflowConfigParser class is that backwards compatibility is >> much >>>>>>>> much >>>>>>>>>>>> harder. >>>>>>>>>>>> >>>>>>>>>>>> The main advantage of Approach 2 is the the code is released >>>>>>>>>> with/embedded >>>>>>>>>>>> in the dist (i.e. apache-airflow-task-sdk would contain the >> right >>>>>>>>>> version >>>>>>>>>>>> of the logging config and ConfigParser etc). The downside is >> that >>>>>>>>> either >>>>>>>>>>>> the code will need to be duplicated in the repo, or better yet >> it >>>>>>>>> would >>>>>>>>>>>> live in a single place in the repo, but some tooling (TBD) will >>>>>>>>>>>> automatically handle the duplication, either at commit time, or >> my >>>>>>>>>>>> preference, at release time. >>>>>>>>>>>> >>>>>>>>>>>> For this kind of shared “utility” code I am very strongly >> leaning >>>>>>>>>> towards >>>>>>>>>>>> option 2 with automation, as otherwise I think the backwards >>>>>>>>>> compatibility >>>>>>>>>>>> requirements would make it unworkable (very quickly over time >> the >>>>>>>>>>>> combinations we would have to test would just be unreasonable) >>>> and I >>>>>>>>>> don’t >>>>>>>>>>>> feel confident we can have things as stable as we need to really >>>>>>>>> deliver >>>>>>>>>>>> the version separation/independency I want to delivery with >>>> AIP-72. >>>>>>>>>>>> >>>>>>>>>>>> So unless someone feels very strongly about this, I will come up >>>>>>>> with >>>>>>>>> a >>>>>>>>>>>> draft PR for further discussion that will implement code sharing >>>> via >>>>>>>>>>>> “vendoring” it at build time. I have an idea of how I can >> achieve >>>>>>>> this >>>>>>>>>> so >>>>>>>>>>>> we have a single version in the repo and it’ll work there, but >> at >>>>>>>>>> runtime >>>>>>>>>>>> we vendor it in to the shipped dist so it lives at something >> like >>>>>>>>>>>> `airflow.sdk._vendor` etc. >>>>>>>>>>>> >>>>>>>>>>>> In terms of repo layout, this likely means we would end up with: >>>>>>>>>>>> >>>>>>>>>>>> airflow-core/pyproject.toml >>>>>>>>>>>> airflow-core/src/ >>>>>>>>>>>> airflow-core/tests/ >>>>>>>>>>>> task-sdk/pyproject.toml >>>>>>>>>>>> task-sdk/src/ >>>>>>>>>>>> task-sdk/tests/ >>>>>>>>>>>> airflow-common/src >>>>>>>>>>>> airflow-common/tests/ >>>>>>>>>>>> # Possibly no airflow-common/pyproject.toml, as deps would be >>>>>>>> included >>>>>>>>>> in >>>>>>>>>>>> the downstream projects. TBD. >>>>>>>>>>>> >>>>>>>>>>>> Thoughts and feedback welcomed. >>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >>>>>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> For additional commands, e-mail: dev-h...@airflow.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org