Thanks Jarek for starting the discussion which I needed to think about
(and was busy all day) until reading the thread - is already well
evolved with details!
I assume also we need to make 1-2 PoCs considering how this is made,
more with @decorators (like @requires[ORM] to signal what dependency is
needed (whereas this would need to be applied to many places in the
codebase... so a bit un-cool) or via very selectively init only (based
on the context of execution as sketched by Jarek below) - maybe still in
conjunction with lazy loading because we waste also a lot of time per
task just to get some lines of code going. Like if you call the classic
CLI it just alone takes 2-3 seconds until --help is responding.
I would not prefer to move packages in the codebase. But we need to
cut-down what is really loaded - maybe pyspy can also help. And I assume
some extensive thinking is needed. What would help is also if we check
which code has what dependency - you call it the cyclomatic dependency -
which I think we have and need to cut-off.
Yesterday I thought it might be a breaking change - agree to Jarek it
might be also possibly for a lot internally non-breaking - but finally
at point of providers the interface between providers and core might
need to change - e.g. init providers manager only from YAML information
and not loading/importing all classes at time of starting plugins_manager.
I'd be also supportive and would like to engage - realistic would be 3.2
to have a proper plan - unfortunately am pretty busy these times - so if
somebody wants to take the lead this would be cool!
Jens
P.S.: Does it mean the bug fix in #52952 is blocked until resolution or
can we agree to have this as intermediate until a proper support is
made... and potentially support for 3.1 is dropped in edge3? Until then
I assume there is no better way then the ugly code...
On 07.07.25 16:27, Amogh Desai wrote:
+1 for doing this.
I recently went through some pain along similar veins and came to the
conclusion that `import airflow` does a lot!
Not a concrete plan, but a starting to investigate the airflow/__init__.py
and
all other init's to see what is being initialised (config, ORM, logging,
etc etc) would probably
be a good starting point.
We do have a decent initialise module but it is scattered, we should
probably have a
`airflow/initialization` or so module in my opinion with utils to do the
hard work:
- config
- orm
- logging
- plugins etc
Then start hunting down one CLI at a time :). Easier said than done though!
Thanks & Regards,
Amogh Desai
On Mon, Jul 7, 2025 at 6:41 PM Jarek Potiuk<ja...@potiuk.com> wrote:
This might be related, or it might not be, but I think I would also love
it if we moved all of “core” (scheduler, jobs, api server etc) to
airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning
`airflow` would be left for just `airflow.sdk` and `airflow.providers`,
plus some compat shims, possibly installed by apache-airflow-task-sdk
itself). Were you thinking something similar?
I do not have exact details yet, it's more about "changing
the philosophy of initialisation". I think it would need some POC to come
up with some details (but unfortunately such POC will require quite an
investment and when done it would be almost complete - as there are so many
intertwined things in our initialization that you only find out stuff after
you move things :) . That's my experience from previous attempts. Usually
it started with - hey I can move this and that here and we will be good,
but after doing it, it turned out that the other parts have to be also
touched and it caused an avalanche of changes ripping through the whole
codebase almost (to the point that I gave up).
But yes that might be one of the ways to achieve that. I am all for trying
it and seeing how it might work out.
J
On Mon, Jul 7, 2025 at 2:49 PM Ash Berlin-Taylor<a...@apache.org> wrote:
Yeah, this has been a long time bugbear of mine and would love to remove
the magic and the side-effects of `import airflow`.
Do you have any plans or thoughts about how to actually achieve this?
This might be related, or it might not be, but I think I would also love
it if we moved all of “core” (scheduler, jobs, api server etc) to
airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning
`airflow` would be left for just `airflow.sdk` and `airflow.providers`,
plus some compat shims, possibly installed by apache-airflow-task-sdk
itself). Were you thinking something similar?
-ash
On 7 Jul 2025, at 12:35, Jarek Potiuk<ja...@potiuk.com> wrote:
I would like to raise another discussion here - about fixing
`airflow.__init__.py` excessive initialization pattern - potentially.
It
results from
https://github.com/apache/airflow/pull/52952#discussion_r2188492257
discussion.
This is something we have been seeing for quite some time in Airflow 1
and
2 and now we still have some problems with it in Airflow 3, and I think
with completing Task Isolation work, we have a chance to straighten it
out.
Currently, we just do a LOT of stuff when we do `import airflow` -
initializing configurations, settings, secrets, registering ORM models
..
you name it..
This is - likely as it has never been documented so I am guessing the
root
cause now - result of the philosophy that "import airflow" should get
you
up and running and everything needed should be already "ready for use".
This allows for example to open a REPL in python in airflow venv, do
"import airflow" - and everything you would like to do should be
possible
to do. And it's coming from the highly monolithic architecture of
Airflow
where we had just one package. And I think we do not have to hold to
this
assumption/expectation.
The thing is that the whole environment is changing in Airflow 3 and it
will change even further when task isolation is completed. We simply do
not
have a monolithic structure of packages and we have several
distributions
sharing "airflow" and they might or might not be installed together
which
adds a lot of complexity if we rely on "__init__.py" code being
executed.
While (years ago) I proposed in the past to make separate "top level"
packages (for example "airflow_providers" for providers) - this
proposal
has been rejected by the community and "airflow" became the common
"root"
package for everything, At the same time it causes that the common
"initialization" code is shared - but not really - because sometimes
our
distributions can be installed together, sometimes separately - and we
need
to handle a lot of complexity and implement some hacks to make this
"common" initialization to work in all scenarios.
And it leads to a number of complexities and problems we (and our
users)
often experience:
* there are often "module not fully initialized" errors that are
difficult
to debug and fix when we are trying to import parts of airflow from
other
modules that are "being initialized" (logging, secrets managers are
particularly susceptible to that) - we have a lot of "local imports"
and
other ways to deal with it.
* we have a lot of "lazy-loading" implemented - in both production code
and
tests - just to handle the conditional nature of some things - for
example @provider_configurations_loaded decorator is implemented
specifically to defer initializing providers when they are going to be
used. This is not the "best" pattern but one that works in the
circumstances of init doing a lot - and it's a direct result of us
doing
this heavy initialisation. It could have been simplified if we do
explicit
initialization of things when needed in specific CLI commands
* our "plugins" interface that used to be "all-in-one" is now pretty
fragmented across what needs to be initialized where. While Scheduler
needs
"timetable" plugins, it does not need "macros" nor "fast_api_apps" and
it
should not initialize them, but "webserver" on the other hand needs
"fast_api_apps" and worker also needs "global_operator_links" (this is
a
recent change I think - they used to be rendered in web server).
* we have hard time on deciding when we should do certain parts of
initialization - for example currently plugins manager is initialized
in
"import airflow" effectively - and it means that the only way to find
out
what is the "cli" command we run is look at the arguments of
interpreter
-
so that we can "guess" if we are run as worker or api_server - because
after the split, we are not supposed to always initialize all plugins -
so
current implementation in #52952 is ....weird.... out of necessity:
# Load the API endpoint only on api-server (Airflow 3.x) or webserver
(Airflow 2.x)
if AIRFLOW_V_3_0_PLUS:
RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if
len(sys.argv)
1 else False
else:
RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and
"airflow-webserver" in sys.argv
*Now, how to fix it? *
I think the answer is in Python Zen "explicit is better than
implicit". I
think we could simplify a lot of code if we drop the assumption that
"import airflow" does everything for you. In fact it should do pretty
much
**nothing**. Then whenever a particular CLI of airflow is run, we
should
explicitly initialize whatever we need.
Say:
* airflow api_server -> configuration, settings, database,
fast_api_server
and main "airflow" app
* celery worker -> configuration, settings, task_sdk, fast_api_server
and
"serve_logs" app, "macro plugins". "global_operator_links",
* scheduler -> configuration, settings, database, timetable plugins,
etc. etc. In always the right sequence (this matters a lot and it is
currently one of the sources of problems that depending which package
you
import first our lazy loading might work differently), with minimal
lazy
loading - i.e minimal implicitness.
I attempted to do it partially in the past (I guess 3 times) and failed
miserably because of intermixing of configuration, settings and
database
-
but with a lot of work being done on task isolation, I think a lot of
the
roadblocks there are either being handled or handled already.
Also I think it's not a "breaking" change. We never actually promised
that
"import airflow" does all the initialization. If this is relied on -
it's
mostly in CI/ tests etc. and should be easily remediated by providing
appropriate initialization calls (and appropriate sequence of those
initializations.
I am happy to lead that effort if we agree this is a good direction. It
might already also be kind of planned (explicitly or implicitly) as
part
of
task isolation work - so maybe what I am writing about have already
been
taken into account (but I have not seen it explicitly addressed) and I
am
happy to help there as well.
I would love to hear your opinions on that.
J.
---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@airflow.apache.org
For additional commands, e-mail:dev-h...@airflow.apache.org