Yeah, this has been a long time bugbear of mine and would love to remove the magic and the side-effects of `import airflow`.
Do you have any plans or thoughts about how to actually achieve this? This might be related, or it might not be, but I think I would also love it if we moved all of “core” (scheduler, jobs, api server etc) to airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning `airflow` would be left for just `airflow.sdk` and `airflow.providers`, plus some compat shims, possibly installed by apache-airflow-task-sdk itself). Were you thinking something similar? -ash > On 7 Jul 2025, at 12:35, Jarek Potiuk <ja...@potiuk.com> wrote: > > I would like to raise another discussion here - about fixing > `airflow.__init__.py` excessive initialization pattern - potentially. It > results from > https://github.com/apache/airflow/pull/52952#discussion_r2188492257 > discussion. > > This is something we have been seeing for quite some time in Airflow 1 and > 2 and now we still have some problems with it in Airflow 3, and I think > with completing Task Isolation work, we have a chance to straighten it out. > > Currently, we just do a LOT of stuff when we do `import airflow` - > initializing configurations, settings, secrets, registering ORM models .. > you name it.. > > This is - likely as it has never been documented so I am guessing the root > cause now - result of the philosophy that "import airflow" should get you > up and running and everything needed should be already "ready for use". > This allows for example to open a REPL in python in airflow venv, do > "import airflow" - and everything you would like to do should be possible > to do. And it's coming from the highly monolithic architecture of Airflow > where we had just one package. And I think we do not have to hold to this > assumption/expectation. > > The thing is that the whole environment is changing in Airflow 3 and it > will change even further when task isolation is completed. We simply do not > have a monolithic structure of packages and we have several distributions > sharing "airflow" and they might or might not be installed together which > adds a lot of complexity if we rely on "__init__.py" code being executed. > > While (years ago) I proposed in the past to make separate "top level" > packages (for example "airflow_providers" for providers) - this proposal > has been rejected by the community and "airflow" became the common "root" > package for everything, At the same time it causes that the common > "initialization" code is shared - but not really - because sometimes our > distributions can be installed together, sometimes separately - and we need > to handle a lot of complexity and implement some hacks to make this > "common" initialization to work in all scenarios. > > And it leads to a number of complexities and problems we (and our users) > often experience: > > * there are often "module not fully initialized" errors that are difficult > to debug and fix when we are trying to import parts of airflow from other > modules that are "being initialized" (logging, secrets managers are > particularly susceptible to that) - we have a lot of "local imports" and > other ways to deal with it. > > * we have a lot of "lazy-loading" implemented - in both production code and > tests - just to handle the conditional nature of some things - for > example @provider_configurations_loaded decorator is implemented > specifically to defer initializing providers when they are going to be > used. This is not the "best" pattern but one that works in the > circumstances of init doing a lot - and it's a direct result of us doing > this heavy initialisation. It could have been simplified if we do explicit > initialization of things when needed in specific CLI commands > > * our "plugins" interface that used to be "all-in-one" is now pretty > fragmented across what needs to be initialized where. While Scheduler needs > "timetable" plugins, it does not need "macros" nor "fast_api_apps" and it > should not initialize them, but "webserver" on the other hand needs > "fast_api_apps" and worker also needs "global_operator_links" (this is a > recent change I think - they used to be rendered in web server). > > * we have hard time on deciding when we should do certain parts of > initialization - for example currently plugins manager is initialized in > "import airflow" effectively - and it means that the only way to find out > what is the "cli" command we run is look at the arguments of interpreter - > so that we can "guess" if we are run as worker or api_server - because > after the split, we are not supposed to always initialize all plugins - so > current implementation in #52952 is ....weird.... out of necessity: > > # Load the API endpoint only on api-server (Airflow 3.x) or webserver > (Airflow 2.x) > if AIRFLOW_V_3_0_PLUS: > RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if len(sys.argv) > > 1 else False > else: > RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and > "airflow-webserver" in sys.argv > > *Now, how to fix it? * > > I think the answer is in Python Zen "explicit is better than implicit". I > think we could simplify a lot of code if we drop the assumption that > "import airflow" does everything for you. In fact it should do pretty much > **nothing**. Then whenever a particular CLI of airflow is run, we should > explicitly initialize whatever we need. > > Say: > > * airflow api_server -> configuration, settings, database, fast_api_server > and main "airflow" app > * celery worker -> configuration, settings, task_sdk, fast_api_server and > "serve_logs" app, "macro plugins". "global_operator_links", > * scheduler -> configuration, settings, database, timetable plugins, > > etc. etc. In always the right sequence (this matters a lot and it is > currently one of the sources of problems that depending which package you > import first our lazy loading might work differently), with minimal lazy > loading - i.e minimal implicitness. > > I attempted to do it partially in the past (I guess 3 times) and failed > miserably because of intermixing of configuration, settings and database - > but with a lot of work being done on task isolation, I think a lot of the > roadblocks there are either being handled or handled already. > > Also I think it's not a "breaking" change. We never actually promised that > "import airflow" does all the initialization. If this is relied on - it's > mostly in CI/ tests etc. and should be easily remediated by providing > appropriate initialization calls (and appropriate sequence of those > initializations. > > I am happy to lead that effort if we agree this is a good direction. It > might already also be kind of planned (explicitly or implicitly) as part of > task isolation work - so maybe what I am writing about have already been > taken into account (but I have not seen it explicitly addressed) and I am > happy to help there as well. > > I would love to hear your opinions on that. > > J. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org