+1 for doing this.

I recently went through some pain along similar veins and came to the
conclusion that `import airflow` does a lot!

Not a concrete plan, but a starting to investigate the airflow/__init__.py
and
all other init's to see what is being initialised (config, ORM, logging,
etc etc) would probably
be a good starting point.

We do have a decent initialise module but it is scattered, we should
probably have a
`airflow/initialization` or so module in my opinion with utils to do the
hard work:
- config
- orm
- logging
- plugins etc

Then start hunting down one CLI at a time :). Easier said than done though!

Thanks & Regards,
Amogh Desai


On Mon, Jul 7, 2025 at 6:41 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> > This might be related, or it might not be, but I think I would also love
> it if we moved all of “core” (scheduler, jobs, api server etc) to
> airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning
> `airflow` would be left for just `airflow.sdk` and `airflow.providers`,
> plus some compat shims, possibly installed by apache-airflow-task-sdk
> itself). Were you thinking something similar?
>
> I do not have exact details yet, it's more about "changing
> the philosophy of initialisation". I think it would need some POC to come
> up with some details (but unfortunately such POC will require quite an
> investment and when done it would be almost complete - as there are so many
> intertwined things in our initialization that you only find out stuff after
> you move things :) . That's my experience from previous attempts. Usually
> it started with - hey I can move this and that here and we will be good,
> but after doing it, it turned out that the other parts have to be also
> touched and it caused an avalanche of changes ripping through the whole
> codebase almost (to the point that I gave up).
>
> But yes that might be one of the ways to achieve that. I am all for trying
> it and seeing how it might work out.
>
> J
>
> On Mon, Jul 7, 2025 at 2:49 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> > Yeah, this has been a long time bugbear of mine and would love to remove
> > the magic and the side-effects of  `import airflow`.
> >
> > Do you have any plans or thoughts about how to actually achieve this?
> >
> > This might be related, or it might not be, but I think I would also love
> > it if we moved all of “core” (scheduler, jobs, api server etc) to
> > airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning
> > `airflow` would be left for just `airflow.sdk` and `airflow.providers`,
> > plus some compat shims, possibly installed by apache-airflow-task-sdk
> > itself). Were you thinking something similar?
> >
> > -ash
> >
> > > On 7 Jul 2025, at 12:35, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > I would like to raise another discussion here - about fixing
> > > `airflow.__init__.py` excessive initialization pattern - potentially.
> It
> > > results from
> > > https://github.com/apache/airflow/pull/52952#discussion_r2188492257
> > > discussion.
> > >
> > > This is something we have been seeing for quite some time in Airflow 1
> > and
> > > 2 and now we still have some problems with it in Airflow 3, and I think
> > > with completing Task Isolation work, we have a chance to straighten it
> > out.
> > >
> > > Currently, we just do a LOT of stuff when we do `import airflow` -
> > > initializing configurations, settings, secrets, registering ORM models
> ..
> > > you name it..
> > >
> > > This is - likely as it has never been documented so I am guessing the
> > root
> > > cause now - result of the philosophy that "import airflow" should get
> you
> > > up and running and everything needed should be already "ready for use".
> > > This allows for example to open a REPL in python in airflow venv, do
> > > "import airflow" - and everything you would like to do should be
> possible
> > > to do. And it's coming from the highly monolithic architecture of
> Airflow
> > > where we had just one package. And I think we do not have to hold to
> this
> > > assumption/expectation.
> > >
> > > The thing is that the whole environment is changing in Airflow 3 and it
> > > will change even further when task isolation is completed. We simply do
> > not
> > > have a monolithic structure of packages and we have several
> distributions
> > > sharing "airflow" and they might or might not be installed together
> which
> > > adds a lot of complexity if we rely on "__init__.py" code being
> executed.
> > >
> > > While (years ago) I proposed in the past to make separate "top level"
> > > packages (for example "airflow_providers" for providers) - this
> proposal
> > > has been rejected by the community and "airflow" became the common
> "root"
> > > package for everything, At the same time it causes that the common
> > > "initialization" code is shared - but not really - because sometimes
> our
> > > distributions can be installed together, sometimes separately - and we
> > need
> > > to handle a lot of complexity and implement some hacks to make this
> > > "common" initialization to work in all scenarios.
> > >
> > > And it leads to a number of complexities and problems we (and our
> users)
> > > often experience:
> > >
> > > * there are often "module not fully initialized" errors that are
> > difficult
> > > to debug and fix when we are trying to import parts of airflow from
> other
> > > modules that are "being initialized" (logging, secrets managers are
> > > particularly susceptible to that) - we have a lot of "local imports"
> and
> > > other ways to deal with it.
> > >
> > > * we have a lot of "lazy-loading" implemented - in both production code
> > and
> > > tests - just to handle the conditional nature of some things - for
> > > example @provider_configurations_loaded decorator is implemented
> > > specifically to defer initializing providers when they are going to be
> > > used. This is not the "best" pattern but one that works in the
> > > circumstances of init doing a lot  - and it's a direct result of us
> doing
> > > this heavy initialisation. It could have been simplified if we do
> > explicit
> > > initialization of things when needed in specific CLI commands
> > >
> > > * our "plugins" interface that used to be "all-in-one" is now pretty
> > > fragmented across what needs to be initialized where. While Scheduler
> > needs
> > > "timetable" plugins, it does not need "macros" nor "fast_api_apps" and
> it
> > > should not initialize them, but "webserver" on the other hand needs
> > > "fast_api_apps" and worker also needs "global_operator_links" (this is
> a
> > > recent change I think - they used to be rendered in web server).
> > >
> > > * we have hard time on deciding when we should do certain parts of
> > > initialization - for example currently plugins manager is initialized
> in
> > > "import airflow" effectively - and it means that the only way to find
> out
> > > what is the "cli" command we run is look at the arguments of
> interpreter
> > -
> > > so that we can "guess" if we are run as worker or api_server - because
> > > after the split, we are not supposed to always initialize all plugins -
> > so
> > > current implementation in  #52952 is ....weird.... out of necessity:
> > >
> > > # Load the API endpoint only on api-server (Airflow 3.x) or webserver
> > > (Airflow 2.x)
> > > if AIRFLOW_V_3_0_PLUS:
> > >    RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if
> len(sys.argv)
> > >
> > > 1 else False
> > > else:
> > >    RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and
> > > "airflow-webserver" in sys.argv
> > >
> > > *Now, how to fix it? *
> > >
> > > I think the answer is in Python Zen "explicit is better than
> implicit". I
> > > think we could simplify a lot of code if we drop the assumption that
> > > "import airflow" does everything for you. In fact it should do pretty
> > much
> > > **nothing**. Then whenever a particular CLI of airflow is run, we
> should
> > > explicitly initialize whatever we need.
> > >
> > > Say:
> > >
> > > * airflow api_server -> configuration, settings, database,
> > fast_api_server
> > > and main "airflow" app
> > > * celery worker -> configuration, settings, task_sdk, fast_api_server
> and
> > > "serve_logs" app, "macro plugins". "global_operator_links",
> > > * scheduler -> configuration, settings, database, timetable plugins,
> > >
> > > etc. etc.  In always the right sequence (this matters a lot and it is
> > > currently one of the sources of problems that depending which package
> you
> > > import first our lazy loading might work differently), with minimal
> lazy
> > > loading - i.e minimal implicitness.
> > >
> > > I attempted to do it partially in the past (I guess 3 times) and failed
> > > miserably because of intermixing of configuration, settings and
> database
> > -
> > > but with a lot of work being done on task isolation, I think a lot of
> the
> > > roadblocks there are either being handled or handled already.
> > >
> > > Also I think it's not a "breaking" change. We never actually promised
> > that
> > > "import airflow" does all the initialization. If this is relied on -
> it's
> > > mostly in CI/ tests etc. and should be easily remediated by providing
> > > appropriate initialization calls  (and appropriate sequence of those
> > > initializations.
> > >
> > > I am happy to lead that effort if we agree this is a good direction. It
> > > might already also be kind of planned (explicitly or implicitly) as
> part
> > of
> > > task isolation work - so maybe what I am writing about have already
> been
> > > taken into account (but I have not seen it explicitly addressed) and I
> am
> > > happy to help there as well.
> > >
> > > I would love to hear your opinions on that.
> > >
> > > J.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> >
>

Reply via email to