+1 for doing this. I recently went through some pain along similar veins and came to the conclusion that `import airflow` does a lot!
Not a concrete plan, but a starting to investigate the airflow/__init__.py and all other init's to see what is being initialised (config, ORM, logging, etc etc) would probably be a good starting point. We do have a decent initialise module but it is scattered, we should probably have a `airflow/initialization` or so module in my opinion with utils to do the hard work: - config - orm - logging - plugins etc Then start hunting down one CLI at a time :). Easier said than done though! Thanks & Regards, Amogh Desai On Mon, Jul 7, 2025 at 6:41 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > This might be related, or it might not be, but I think I would also love > it if we moved all of “core” (scheduler, jobs, api server etc) to > airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning > `airflow` would be left for just `airflow.sdk` and `airflow.providers`, > plus some compat shims, possibly installed by apache-airflow-task-sdk > itself). Were you thinking something similar? > > I do not have exact details yet, it's more about "changing > the philosophy of initialisation". I think it would need some POC to come > up with some details (but unfortunately such POC will require quite an > investment and when done it would be almost complete - as there are so many > intertwined things in our initialization that you only find out stuff after > you move things :) . That's my experience from previous attempts. Usually > it started with - hey I can move this and that here and we will be good, > but after doing it, it turned out that the other parts have to be also > touched and it caused an avalanche of changes ripping through the whole > codebase almost (to the point that I gave up). > > But yes that might be one of the ways to achieve that. I am all for trying > it and seeing how it might work out. > > J > > On Mon, Jul 7, 2025 at 2:49 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > > Yeah, this has been a long time bugbear of mine and would love to remove > > the magic and the side-effects of `import airflow`. > > > > Do you have any plans or thoughts about how to actually achieve this? > > > > This might be related, or it might not be, but I think I would also love > > it if we moved all of “core” (scheduler, jobs, api server etc) to > > airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning > > `airflow` would be left for just `airflow.sdk` and `airflow.providers`, > > plus some compat shims, possibly installed by apache-airflow-task-sdk > > itself). Were you thinking something similar? > > > > -ash > > > > > On 7 Jul 2025, at 12:35, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > I would like to raise another discussion here - about fixing > > > `airflow.__init__.py` excessive initialization pattern - potentially. > It > > > results from > > > https://github.com/apache/airflow/pull/52952#discussion_r2188492257 > > > discussion. > > > > > > This is something we have been seeing for quite some time in Airflow 1 > > and > > > 2 and now we still have some problems with it in Airflow 3, and I think > > > with completing Task Isolation work, we have a chance to straighten it > > out. > > > > > > Currently, we just do a LOT of stuff when we do `import airflow` - > > > initializing configurations, settings, secrets, registering ORM models > .. > > > you name it.. > > > > > > This is - likely as it has never been documented so I am guessing the > > root > > > cause now - result of the philosophy that "import airflow" should get > you > > > up and running and everything needed should be already "ready for use". > > > This allows for example to open a REPL in python in airflow venv, do > > > "import airflow" - and everything you would like to do should be > possible > > > to do. And it's coming from the highly monolithic architecture of > Airflow > > > where we had just one package. And I think we do not have to hold to > this > > > assumption/expectation. > > > > > > The thing is that the whole environment is changing in Airflow 3 and it > > > will change even further when task isolation is completed. We simply do > > not > > > have a monolithic structure of packages and we have several > distributions > > > sharing "airflow" and they might or might not be installed together > which > > > adds a lot of complexity if we rely on "__init__.py" code being > executed. > > > > > > While (years ago) I proposed in the past to make separate "top level" > > > packages (for example "airflow_providers" for providers) - this > proposal > > > has been rejected by the community and "airflow" became the common > "root" > > > package for everything, At the same time it causes that the common > > > "initialization" code is shared - but not really - because sometimes > our > > > distributions can be installed together, sometimes separately - and we > > need > > > to handle a lot of complexity and implement some hacks to make this > > > "common" initialization to work in all scenarios. > > > > > > And it leads to a number of complexities and problems we (and our > users) > > > often experience: > > > > > > * there are often "module not fully initialized" errors that are > > difficult > > > to debug and fix when we are trying to import parts of airflow from > other > > > modules that are "being initialized" (logging, secrets managers are > > > particularly susceptible to that) - we have a lot of "local imports" > and > > > other ways to deal with it. > > > > > > * we have a lot of "lazy-loading" implemented - in both production code > > and > > > tests - just to handle the conditional nature of some things - for > > > example @provider_configurations_loaded decorator is implemented > > > specifically to defer initializing providers when they are going to be > > > used. This is not the "best" pattern but one that works in the > > > circumstances of init doing a lot - and it's a direct result of us > doing > > > this heavy initialisation. It could have been simplified if we do > > explicit > > > initialization of things when needed in specific CLI commands > > > > > > * our "plugins" interface that used to be "all-in-one" is now pretty > > > fragmented across what needs to be initialized where. While Scheduler > > needs > > > "timetable" plugins, it does not need "macros" nor "fast_api_apps" and > it > > > should not initialize them, but "webserver" on the other hand needs > > > "fast_api_apps" and worker also needs "global_operator_links" (this is > a > > > recent change I think - they used to be rendered in web server). > > > > > > * we have hard time on deciding when we should do certain parts of > > > initialization - for example currently plugins manager is initialized > in > > > "import airflow" effectively - and it means that the only way to find > out > > > what is the "cli" command we run is look at the arguments of > interpreter > > - > > > so that we can "guess" if we are run as worker or api_server - because > > > after the split, we are not supposed to always initialize all plugins - > > so > > > current implementation in #52952 is ....weird.... out of necessity: > > > > > > # Load the API endpoint only on api-server (Airflow 3.x) or webserver > > > (Airflow 2.x) > > > if AIRFLOW_V_3_0_PLUS: > > > RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if > len(sys.argv) > > > > > > 1 else False > > > else: > > > RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and > > > "airflow-webserver" in sys.argv > > > > > > *Now, how to fix it? * > > > > > > I think the answer is in Python Zen "explicit is better than > implicit". I > > > think we could simplify a lot of code if we drop the assumption that > > > "import airflow" does everything for you. In fact it should do pretty > > much > > > **nothing**. Then whenever a particular CLI of airflow is run, we > should > > > explicitly initialize whatever we need. > > > > > > Say: > > > > > > * airflow api_server -> configuration, settings, database, > > fast_api_server > > > and main "airflow" app > > > * celery worker -> configuration, settings, task_sdk, fast_api_server > and > > > "serve_logs" app, "macro plugins". "global_operator_links", > > > * scheduler -> configuration, settings, database, timetable plugins, > > > > > > etc. etc. In always the right sequence (this matters a lot and it is > > > currently one of the sources of problems that depending which package > you > > > import first our lazy loading might work differently), with minimal > lazy > > > loading - i.e minimal implicitness. > > > > > > I attempted to do it partially in the past (I guess 3 times) and failed > > > miserably because of intermixing of configuration, settings and > database > > - > > > but with a lot of work being done on task isolation, I think a lot of > the > > > roadblocks there are either being handled or handled already. > > > > > > Also I think it's not a "breaking" change. We never actually promised > > that > > > "import airflow" does all the initialization. If this is relied on - > it's > > > mostly in CI/ tests etc. and should be easily remediated by providing > > > appropriate initialization calls (and appropriate sequence of those > > > initializations. > > > > > > I am happy to lead that effort if we agree this is a good direction. It > > > might already also be kind of planned (explicitly or implicitly) as > part > > of > > > task isolation work - so maybe what I am writing about have already > been > > > taken into account (but I have not seen it explicitly addressed) and I > am > > > happy to help there as well. > > > > > > I would love to hear your opinions on that. > > > > > > J. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > >