Yeah, this has been a long time bugbear of mine and would love to remove the 
magic and the side-effects of  `import airflow`.

Do you have any plans or thoughts about how to actually achieve this?

This might be related, or it might not be, but I think I would also love it if 
we moved all of “core” (scheduler, jobs, api server etc) to airflow_core.* 
python modules , and out of `airflow.*` entirely. (Meaning `airflow` would be 
left for just `airflow.sdk` and `airflow.providers`, plus some compat shims, 
possibly installed by apache-airflow-task-sdk itself). Were you thinking 
something similar?

-ash

> On 7 Jul 2025, at 12:35, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> I would like to raise another discussion here - about fixing
> `airflow.__init__.py` excessive initialization pattern - potentially. It
> results from
> https://github.com/apache/airflow/pull/52952#discussion_r2188492257
> discussion.
> 
> This is something we have been seeing for quite some time in Airflow 1 and
> 2 and now we still have some problems with it in Airflow 3, and I think
> with completing Task Isolation work, we have a chance to straighten it out.
> 
> Currently, we just do a LOT of stuff when we do `import airflow` -
> initializing configurations, settings, secrets, registering ORM models ..
> you name it..
> 
> This is - likely as it has never been documented so I am guessing the root
> cause now - result of the philosophy that "import airflow" should get you
> up and running and everything needed should be already "ready for use".
> This allows for example to open a REPL in python in airflow venv, do
> "import airflow" - and everything you would like to do should be possible
> to do. And it's coming from the highly monolithic architecture of Airflow
> where we had just one package. And I think we do not have to hold to this
> assumption/expectation.
> 
> The thing is that the whole environment is changing in Airflow 3 and it
> will change even further when task isolation is completed. We simply do not
> have a monolithic structure of packages and we have several distributions
> sharing "airflow" and they might or might not be installed together which
> adds a lot of complexity if we rely on "__init__.py" code being executed.
> 
> While (years ago) I proposed in the past to make separate "top level"
> packages (for example "airflow_providers" for providers) - this proposal
> has been rejected by the community and "airflow" became the common "root"
> package for everything, At the same time it causes that the common
> "initialization" code is shared - but not really - because sometimes our
> distributions can be installed together, sometimes separately - and we need
> to handle a lot of complexity and implement some hacks to make this
> "common" initialization to work in all scenarios.
> 
> And it leads to a number of complexities and problems we (and our users)
> often experience:
> 
> * there are often "module not fully initialized" errors that are difficult
> to debug and fix when we are trying to import parts of airflow from other
> modules that are "being initialized" (logging, secrets managers are
> particularly susceptible to that) - we have a lot of "local imports" and
> other ways to deal with it.
> 
> * we have a lot of "lazy-loading" implemented - in both production code and
> tests - just to handle the conditional nature of some things - for
> example @provider_configurations_loaded decorator is implemented
> specifically to defer initializing providers when they are going to be
> used. This is not the "best" pattern but one that works in the
> circumstances of init doing a lot  - and it's a direct result of us doing
> this heavy initialisation. It could have been simplified if we do explicit
> initialization of things when needed in specific CLI commands
> 
> * our "plugins" interface that used to be "all-in-one" is now pretty
> fragmented across what needs to be initialized where. While Scheduler needs
> "timetable" plugins, it does not need "macros" nor "fast_api_apps" and it
> should not initialize them, but "webserver" on the other hand needs
> "fast_api_apps" and worker also needs "global_operator_links" (this is a
> recent change I think - they used to be rendered in web server).
> 
> * we have hard time on deciding when we should do certain parts of
> initialization - for example currently plugins manager is initialized in
> "import airflow" effectively - and it means that the only way to find out
> what is the "cli" command we run is look at the arguments of interpreter -
> so that we can "guess" if we are run as worker or api_server - because
> after the split, we are not supposed to always initialize all plugins - so
> current implementation in  #52952 is ....weird.... out of necessity:
> 
> # Load the API endpoint only on api-server (Airflow 3.x) or webserver
> (Airflow 2.x)
> if AIRFLOW_V_3_0_PLUS:
>    RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if len(sys.argv) >
> 1 else False
> else:
>    RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and
> "airflow-webserver" in sys.argv
> 
> *Now, how to fix it? *
> 
> I think the answer is in Python Zen "explicit is better than implicit". I
> think we could simplify a lot of code if we drop the assumption that
> "import airflow" does everything for you. In fact it should do pretty much
> **nothing**. Then whenever a particular CLI of airflow is run, we should
> explicitly initialize whatever we need.
> 
> Say:
> 
> * airflow api_server -> configuration, settings, database, fast_api_server
> and main "airflow" app
> * celery worker -> configuration, settings, task_sdk, fast_api_server and
> "serve_logs" app, "macro plugins". "global_operator_links",
> * scheduler -> configuration, settings, database, timetable plugins,
> 
> etc. etc.  In always the right sequence (this matters a lot and it is
> currently one of the sources of problems that depending which package you
> import first our lazy loading might work differently), with minimal lazy
> loading - i.e minimal implicitness.
> 
> I attempted to do it partially in the past (I guess 3 times) and failed
> miserably because of intermixing of configuration, settings and database -
> but with a lot of work being done on task isolation, I think a lot of the
> roadblocks there are either being handled or handled already.
> 
> Also I think it's not a "breaking" change. We never actually promised that
> "import airflow" does all the initialization. If this is relied on - it's
> mostly in CI/ tests etc. and should be easily remediated by providing
> appropriate initialization calls  (and appropriate sequence of those
> initializations.
> 
> I am happy to lead that effort if we agree this is a good direction. It
> might already also be kind of planned (explicitly or implicitly) as part of
> task isolation work - so maybe what I am writing about have already been
> taken into account (but I have not seen it explicitly addressed) and I am
> happy to help there as well.
> 
> I would love to hear your opinions on that.
> 
> J.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to