On Mon, 7 Jul 2025, 21:51 Jarek Potiuk, <ja...@potiuk.com> wrote:
Looks good but I think we should find some better logical name for
core_and_sdk :)
pon., 7 lip 2025, 21:44 użytkownik Jens Scheffler
<j_scheff...@gmx.de.invalid> napisał:
Cool! Especially the "shared" folder with the ability to have
N-combinations w/o exploding project repo root!
On 07.07.25 14:43, Ash Berlin-Taylor wrote:
Oh, and all of this will be explain in shared/README.md
On 7 Jul 2025, at 13:41, Ash Berlin-Taylor <a...@apache.org> wrote:
Okay, so it seems we have agreement on the approach here, so I’ll
continue with this, and on the dev call it was mentioned that
“airflow-common” wasn’t a great name, so here is my proposal for the file
structure;
```
/
task-sdk/...
airflow-core/...
shared/
kuberenetes/
pyproject.toml
src/
airflow_kube/__init__.py
core-and-tasksdk/
pyproject.toml
src/
airflow_shared/__init__.py
```
Things to note here: the “shared” folder has (the possibility) of
having multiple different shared “libraries” in it, in this example I am
supposing a hypothetical shared kuberenetes folder a world in which we
split the KubePodOperator and the KubeExecutor in to two separate
distributions (example only, not proposing we do that right now, and that
will be a separate discussion)
The other things to note here:
- the folder name in shared aims to be “self-documenting”, hence the
verbose “core-and-tasksdk” to say where the shared library is intended to
be used.
- the python module itself should almost always have an `airflow_` (or
maybe `_airflow_`?) prefix so that it does not conflict with anything
else
we might use. It won’t matter “in production” as those will be vendored
in
to be imported as `airflow/_vendor/airflow_shared` etc, but avoiding
conflicts at dev time with the Finder approach is a good safety measure.
I will start making a real PR for this proposal now, but I’m open to
feedback (either here, or in the PR when I open it)
-ash
On 4 Jul 2025, at 16:55, Jarek Potiuk <ja...@potiuk.com> wrote:
Yeah we have to try it and test - also building packages happens semi
frequently when you run `uv sync` (they use some kind of heuristics
to
decide when) and you can force it with `--reinstall` or `--refresh`.
Package build also happens every time when you run "ci-image build`
now in
breeze so it seems like it will nicely integrate in our workflows.
Looks really cool Ash.
On Fri, Jul 4, 2025 at 5:14 PM Ash Berlin-Taylor <a...@apache.org>
wrote:
It’s not just release time, but any time we build a package which
happens
on “every” CI run. The normal unit tests will use code from
airflow-common/src/airflow_common; the kube tests which build an
image will
build the dists and vendor in the code from that commit.
There is only a single copy of the shared code committed to the
repo,
so
there is never anything to synchronise.
On 4 Jul 2025, at 15:53, Amogh Desai <amoghdesai....@gmail.com>
wrote:
Thanks Ash.
This is really cool and helpful that you were able to test both
scenarios
-- repo checkout
and also installing from the vendored package and the resolution
worked
fine too.
I like this idea compared the to relative import one for few
reasons:
- It feels like it will take some time to adjust to the new coding
standard
that we will lay
if we impose relative imports in the shared dist
- We can continue using repo wise absolute import standards, it is
also
much easier for situations
when we do global search in IDE to find + replace, this could mean
a
change
there
- The vendoring work is a proven and established paradigm across
projects
and would
out of box give us the build tooling we need also
Nothing too against the relative import but with the evidence
provided
above, vendored approach
seems to only do us good.
Regarding synchronizing it, release time should be fine as long as
we
have
a good CI workflow to probably
catch such issues per PR if changes are made in shared dist?
(precommit
would make it really slow i guess)
If we can run our tests with vendored code we should be mostly
covered.
Good effort all!
Thanks & Regards,
Amogh Desai
On Fri, Jul 4, 2025 at 7:23 PM Ash Berlin-Taylor <a...@apache.org>
wrote:
Okay, I think I’ve got something that works and I’m happy with.
https://github.com/astronomer/airflow/tree/shared-vendored-lib-tasksdk-and-core
This produces the following from `uv build task-sdk`
-
https://github.com/user-attachments/files/21058976/apache_airflow_task_sdk-1.1.0.tar.gz
-
https://github.com/user-attachments/files/21058996/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip
(`.whl.zip` as GH won't allow .whl upload, but will .zip)
```
❯ unzip -l
dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip |
grep
_vendor
50 02-02-2020 00:00 airflow/sdk/_vendor/.gitignore
2082 02-02-2020 00:00 airflow/sdk/_vendor/__init__.py
28 02-02-2020 00:00 airflow/sdk/_vendor/airflow_common.pyi
18 02-02-2020 00:00 airflow/sdk/_vendor/vendor.txt
785 02-02-2020 00:00
airflow/sdk/_vendor/airflow_common/__init__.py
10628 02-02-2020 00:00
airflow/sdk/_vendor/airflow_common/timezone.py
```
And similarly in the .tar.gz, so our “sdist” is complete too:
```
❯ tar -tzf dist/apache_airflow_task_sdk-1.1.0.tar.gz |grep _vendor
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/.gitignore
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/__init__.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common.pyi
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/vendor.txt
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/__init__.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/timezone.py
```
The plugin works at build time by including/copying the libs
specified
in
vendor.txt into place (and let `vendoring` take care of import
rewrites.)
For the imports to continue to work at “dev” time/from a repo
checkout,
I
have added a import finder to `sys.meta_path`, and since it’s at
the
end of
the list it will only be used if the normal import can’t find
things.
https://github.com/astronomer/airflow/blob/996817782be6071b306a87af9f36fe1cf2d3aaa3/task-sdk/src/airflow/sdk/_vendor/__init__.py
This doesn’t quite give us the same runtime effect “import
rewriting”
affect, as in this approach `airflow_common` is directly loaded
(i.e.
airflow.sdk._vendor.airflow_common and airflow_common exist in
sys.modules), but it does work for everything that I was able to
test..
I tested it with the diff at the end of this message. My test
ipython
shell:
```
In [1]: from airflow.sdk._vendor.airflow_common.timezone import
foo
In [2]: foo
Out[2]: 1
In [3]: import airflow.sdk._vendor.airflow_common
In [4]: import airflow.sdk._vendor.airflow_common.timezone
In [5]: airflow.sdk._vendor.airflow_common.__file__
Out[5]:
'/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/__init__.py'
In [6]: airflow.sdk._vendor.airflow_common.timezone.__file__
Out[6]:
'/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/timezone.py'
```
And in an standalone environment with the SDK dist I built (it
needed
the
matching airflow-core right now, but that is nothing to do with
this
discussion):
```
❯ _AIRFLOW__AS_LIBRARY=1 uvx --python 3.12 --with
dist/apache_airflow_core-3.1.0-py3-none-any.whl --with
dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl ipython
Python 3.12.7 (main, Oct 16 2024, 07:12:08) [Clang 18.1.8 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for
help.
Tip: You can use `%hist` to view history, see the options with
`%history?`
In [1]: import airflow.sdk._vendor.airflow_common.timezone
In [2]: airflow.sdk._vendor.airflow_common.timezone.__file__
Out[2]:
'/Users/ash/.cache/uv/archive-v0/WWq6r65aPto2eJOyPObEH/lib/python3.12/site-packages/airflow/sdk/_vendor/airflow_common/timezone.py’
``
```diff
diff --git a/airflow-common/src/airflow_common/__init__.py
b/airflow-common/src/airflow_common/__init__.py
index 13a83393a9..927b7c6b61 100644
--- a/airflow-common/src/airflow_common/__init__.py
+++ b/airflow-common/src/airflow_common/__init__.py
@@ -14,3 +14,5 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
+
+foo = 1
diff --git a/airflow-common/src/airflow_common/timezone.py
b/airflow-common/src/airflow_common/timezone.py
index 340b924c66..58384ef20f 100644
--- a/airflow-common/src/airflow_common/timezone.py
+++ b/airflow-common/src/airflow_common/timezone.py
@@ -36,6 +36,9 @@ _PENDULUM3 =
version.parse(metadata.version("pendulum")).major == 3
# - FixedTimezone(0, "UTC") in pendulum 2
utc = pendulum.UTC
+
+from airflow_common import foo
+
TIMEZONE: Timezone
```
On 3 Jul 2025, at 12:43, Jarek Potiuk <ja...@potiuk.com> wrote:
I think both approaches are doable:
1) -> We can very easily prevent bad imports by pre-commit when
importing
from different distributions and make sure we are only doing
relative
imports in the shared modules. We are doing plenty of this
already. And
yes
it would require relative links we currently do not allow.
2) -> has one disadvantage that someone at some point in time
will
have
to
decide to synchronize this and if it happens just before release
(I bet
this is going to happen) this will lead to solving problems that
would
normally be solved during PR when you make a change (i.e.
symbolic
link
has
the advantage that whoever modifies shared code will be
immediately
notified in their PR - that they broke something because either
static
checks or mypy or tests fail.
Ash, do you have an idea of a process (who and when) does the
synchronisation in case of vendoring? Maybe we could solve it if
it is
done
more frequently and with some regularity? We could potentially
force
re-vendoring at PR time as well any time shared code changes (and
prevent
it by pre-commit. And I can't think of some place (other than
releases)
in
our development workflow and that seems to be a bit too late as
puts an
extra effort on fixing potential incompatibilities introduced on
release
manager and delays the release. WDYT?
Re: relative links. I think for a shared library we could
potentially
relax
this and allow them (and actually disallow absolute links in the
pieces
of
code that are shared - again, by pre-commit). As I recall, the
only
reason
we forbade the relative links is because of how we are (or maybe
were)
doing DAG parsing and failures resulting from it. So we decided
to
just
not
allow it to keep consistency. The way how Dag parsing works is
that
when
you are using importlib to read the Dag from a file, the relative
imports
do not work as it does not know what they should be relative to.
But if
relative import is done from an imported package, it should be no
problem,
I think - otherwise our Dags would not be able to import any
library
that
uses relative imports.
Of course consistency might be the reason why we do not want to
introduce
relative imports. I don't see it as an issue if it is guarded by
pre-commit
though.
J.
J.
czw., 3 lip 2025, 12:11 użytkownik Ash Berlin-Taylor <
a...@apache.org>
napisał:
Oh yes, symlinks will work, with one big caveat: It does mean
you
can’t
use absolute imports in one common module to another.
For example
https://github.com/apache/airflow/blob/4c66ebd06/airflow-core/src/airflow/utils/serve_logs.py#L41
where we have
```
from airflow.utils.module_loading import import_string
```
if we want to move serve_logs into this common lib that is then
symlinked
then we wouldn’t be able to have `from
airflow_common.module_loading
import
import_string`.
I can think of two possible solutions here.
1) is to allow/require relative imports in this shared lib, so
`from
.module_loading import import_string`
2) is to use `vendoring`[1] (from the pip maintainers) which
will
handle
import-rewriting for us.
I’d entirely forgot that symlinks in repos was a thing, so I
prepared
a
minimal POC/demo of what vendoring approach could look like here
https://github.com/apache/airflow/commit/996817782be6071b306a87af9f36fe1cf2d3aaa3
Now personally I am more than happy with relative imports, but
generally
as a project we have avoided them, so I think that limits what
we
could
do
with a symlink based approach.
-ash
[1] https://github.com/pradyunsg/vendoring
On 3 Jul 2025, at 10:30, Pavankumar Gopidesu <
gopidesupa...@gmail.com>
wrote:
Thanks Ash
Yes agree option 2 would be preferred for me. Making sure we
have all
the
gaurdriles to protect any unwanted behaviour in code sharing
and
executing
right of tests between the packages.
Agree with others, option 2 would be
On Thu, Jul 3, 2025 at 10:02 AM Amogh Desai <
amoghdesai....@gmail.com>
wrote:
Thanks for starting this discussion, Ash.
I would prefer option 2 here with proper tooling to handle the
code
duplication at *release* time.
It is best to have a dist that has all it needs in itself.
Option 1 could very quickly get out of hand and if we decide
to
separate
triggerer /
dag processor / config etc etc as separate packages, back
compat is
going
to be a nightmare
and will bite us harder than we anticipate.
Thanks & Regards,
Amogh Desai
On Thu, Jul 3, 2025 at 1:12 AM Kaxil Naik <
kaxiln...@gmail.com>
wrote:
I prefer Option 2 as well to avoid matrix of dependencies
On Thu, 3 Jul 2025 at 01:03, Jens Scheffler
<j_scheff...@gmx.de.invalid
wrote:
I'd also rather prefer option 2 - reason here is it is
rather
pragmatic
and we no not need to cut another package and have less
package
counts
and dependencies.
I remember some time ago I was checking (together with
Jarek,
I am
not
sure anymore...) if the usage of symlinks would be possible.
To
keep
the
source in one package but "symlink" it into another. If then
at
point
of
packaging/release the files are materialized we have 1 set
of
code.
Otherwise if not possible still the redundancy could be
solved by
a
pre-commit hook - and in Git the files are de-duplicated
anyway
based
on
content hash, so this does not hurt.
On 02.07.25 18:49, Shahar Epstein wrote:
I support option 2 with proper automation & CI - the
reasonings
you've
shown for that make sense to me.
Shahar
On Wed, Jul 2, 2025 at 3:36 PM Ash Berlin-Taylor <
a...@apache.org
wrote:
Hello everyone,
As we work on finishing off the code-level separation of
Task
SDK
and
Core
(scheduler etc) we have come across some situations where
we
would
like
to
share code between these.
However it’s not as straight forward of “just put it in a
common
dist
they
both depend upon” because one of the goals of the Task SDK
separation
was
to have 100% complete version independence between the
two,
ideally
even if
they are built into the same image and venv. Most of the
reason
why
this
isn’t straight forward comes down to backwards
compatibility -
if
we
make
an change to the common/shared distribution
We’ve listed the options we have thought about in
https://github.com/apache/airflow/issues/51545 (but that
covers
some
more
things that I don’t want to get in to in this discussion
such as
possibly
separating operators and executors out of a single
provider
dist.)
To give a concrete example of some code I would like to
share
https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py
— logging config. Another thing we will want to share will
be
the
AirflowConfigParser class from airflow.configuration (but
notably:
only
the
parser class, _not_ the default config values, again, lets
not
dwell
on
the
specifics of that)
So to bring the options listed in the issue here for
discussion,
broadly
speaking there are two high-level approaches:
1. A single shared distribution
2. No shared package and copy/duplicate code
The advantage of Approach 1 is that we only have the code
in one
place.
However for me, at least in this specific case of Logging
config
or
AirflowConfigParser class is that backwards compatibility
is
much
much
harder.
The main advantage of Approach 2 is the the code is
released
with/embedded
in the dist (i.e. apache-airflow-task-sdk would contain
the
right
version
of the logging config and ConfigParser etc). The downside
is
that
either
the code will need to be duplicated in the repo, or better
yet
it
would
live in a single place in the repo, but some tooling (TBD)
will
automatically handle the duplication, either at commit
time, or
my
preference, at release time.
For this kind of shared “utility” code I am very strongly
leaning
towards
option 2 with automation, as otherwise I think the
backwards
compatibility
requirements would make it unworkable (very quickly over
time
the
combinations we would have to test would just be
unreasonable)
and I
don’t
feel confident we can have things as stable as we need to
really
deliver
the version separation/independency I want to delivery
with
AIP-72.
So unless someone feels very strongly about this, I will
come up
with
a
draft PR for further discussion that will implement code
sharing
via
“vendoring” it at build time. I have an idea of how I can
achieve
this
so
we have a single version in the repo and it’ll work there,
but
at
runtime
we vendor it in to the shipped dist so it lives at
something
like
`airflow.sdk._vendor` etc.
In terms of repo layout, this likely means we would end up
with:
airflow-core/pyproject.toml
airflow-core/src/
airflow-core/tests/
task-sdk/pyproject.toml
task-sdk/src/
task-sdk/tests/
airflow-common/src
airflow-common/tests/
# Possibly no airflow-common/pyproject.toml, as deps would
be
included
in
the downstream projects. TBD.
Thoughts and feedback welcomed.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail:
dev-h...@airflow.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org