Nope. No need to defer the fix for edge. I think it's good enough for now
as long as we have a clear road ahead :)

pon., 7 lip 2025, 21:58 użytkownik Jens Scheffler
<j_scheff...@gmx.de.invalid> napisał:

> Thanks Jarek for starting the discussion which I needed to think about
> (and was busy all day) until reading the thread - is already well
> evolved with details!
>
> I assume also we need to make 1-2 PoCs considering how this is made,
> more with @decorators (like @requires[ORM] to signal what dependency is
> needed (whereas this would need to be applied to many places in the
> codebase... so a bit un-cool) or via very selectively init only (based
> on the context of execution as sketched by Jarek below) - maybe still in
> conjunction with lazy loading because we waste also a lot of time per
> task just to get some lines of code going. Like if you call the classic
> CLI it just alone takes 2-3 seconds until --help is responding.
>
> I would not prefer to move packages in the codebase. But we need to
> cut-down what is really loaded - maybe pyspy can also help. And I assume
> some extensive thinking is needed. What would help is also if we check
> which code has what dependency - you call it the cyclomatic dependency -
> which I think we have and need to cut-off.
>
> Yesterday I thought it might be a breaking change - agree to Jarek it
> might be also possibly for a lot internally non-breaking - but finally
> at point of providers the interface between providers and core might
> need to change - e.g. init providers manager only from YAML information
> and not loading/importing all classes at time of starting plugins_manager.
>
> I'd be also supportive and would like to engage - realistic would be 3.2
> to have a proper plan - unfortunately am pretty busy these times - so if
> somebody wants to take the lead this would be cool!
>
> Jens
>
> P.S.: Does it mean the bug fix in #52952 is blocked until resolution or
> can we agree to have this as intermediate until a proper support is
> made... and potentially support for 3.1 is dropped in edge3? Until then
> I assume there is no better way then the ugly code...
>
> On 07.07.25 16:27, Amogh Desai wrote:
> > +1 for doing this.
> >
> > I recently went through some pain along similar veins and came to the
> > conclusion that `import airflow` does a lot!
> >
> > Not a concrete plan, but a starting to investigate the
> airflow/__init__.py
> > and
> > all other init's to see what is being initialised (config, ORM, logging,
> > etc etc) would probably
> > be a good starting point.
> >
> > We do have a decent initialise module but it is scattered, we should
> > probably have a
> > `airflow/initialization` or so module in my opinion with utils to do the
> > hard work:
> > - config
> > - orm
> > - logging
> > - plugins etc
> >
> > Then start hunting down one CLI at a time :). Easier said than done
> though!
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Mon, Jul 7, 2025 at 6:41 PM Jarek Potiuk<ja...@potiuk.com> wrote:
> >
> >>> This might be related, or it might not be, but I think I would also
> love
> >> it if we moved all of “core” (scheduler, jobs, api server etc) to
> >> airflow_core.* python modules , and out of `airflow.*` entirely.
> (Meaning
> >> `airflow` would be left for just `airflow.sdk` and `airflow.providers`,
> >> plus some compat shims, possibly installed by apache-airflow-task-sdk
> >> itself). Were you thinking something similar?
> >>
> >> I do not have exact details yet, it's more about "changing
> >> the philosophy of initialisation". I think it would need some POC to
> come
> >> up with some details (but unfortunately such POC will require quite an
> >> investment and when done it would be almost complete - as there are so
> many
> >> intertwined things in our initialization that you only find out stuff
> after
> >> you move things :) . That's my experience from previous attempts.
> Usually
> >> it started with - hey I can move this and that here and we will be good,
> >> but after doing it, it turned out that the other parts have to be also
> >> touched and it caused an avalanche of changes ripping through the whole
> >> codebase almost (to the point that I gave up).
> >>
> >> But yes that might be one of the ways to achieve that. I am all for
> trying
> >> it and seeing how it might work out.
> >>
> >> J
> >>
> >> On Mon, Jul 7, 2025 at 2:49 PM Ash Berlin-Taylor<a...@apache.org> wrote:
> >>
> >>> Yeah, this has been a long time bugbear of mine and would love to
> remove
> >>> the magic and the side-effects of  `import airflow`.
> >>>
> >>> Do you have any plans or thoughts about how to actually achieve this?
> >>>
> >>> This might be related, or it might not be, but I think I would also
> love
> >>> it if we moved all of “core” (scheduler, jobs, api server etc) to
> >>> airflow_core.* python modules , and out of `airflow.*` entirely.
> (Meaning
> >>> `airflow` would be left for just `airflow.sdk` and `airflow.providers`,
> >>> plus some compat shims, possibly installed by apache-airflow-task-sdk
> >>> itself). Were you thinking something similar?
> >>>
> >>> -ash
> >>>
> >>>> On 7 Jul 2025, at 12:35, Jarek Potiuk<ja...@potiuk.com> wrote:
> >>>>
> >>>> I would like to raise another discussion here - about fixing
> >>>> `airflow.__init__.py` excessive initialization pattern - potentially.
> >> It
> >>>> results from
> >>>> https://github.com/apache/airflow/pull/52952#discussion_r2188492257
> >>>> discussion.
> >>>>
> >>>> This is something we have been seeing for quite some time in Airflow 1
> >>> and
> >>>> 2 and now we still have some problems with it in Airflow 3, and I
> think
> >>>> with completing Task Isolation work, we have a chance to straighten it
> >>> out.
> >>>> Currently, we just do a LOT of stuff when we do `import airflow` -
> >>>> initializing configurations, settings, secrets, registering ORM models
> >> ..
> >>>> you name it..
> >>>>
> >>>> This is - likely as it has never been documented so I am guessing the
> >>> root
> >>>> cause now - result of the philosophy that "import airflow" should get
> >> you
> >>>> up and running and everything needed should be already "ready for
> use".
> >>>> This allows for example to open a REPL in python in airflow venv, do
> >>>> "import airflow" - and everything you would like to do should be
> >> possible
> >>>> to do. And it's coming from the highly monolithic architecture of
> >> Airflow
> >>>> where we had just one package. And I think we do not have to hold to
> >> this
> >>>> assumption/expectation.
> >>>>
> >>>> The thing is that the whole environment is changing in Airflow 3 and
> it
> >>>> will change even further when task isolation is completed. We simply
> do
> >>> not
> >>>> have a monolithic structure of packages and we have several
> >> distributions
> >>>> sharing "airflow" and they might or might not be installed together
> >> which
> >>>> adds a lot of complexity if we rely on "__init__.py" code being
> >> executed.
> >>>> While (years ago) I proposed in the past to make separate "top level"
> >>>> packages (for example "airflow_providers" for providers) - this
> >> proposal
> >>>> has been rejected by the community and "airflow" became the common
> >> "root"
> >>>> package for everything, At the same time it causes that the common
> >>>> "initialization" code is shared - but not really - because sometimes
> >> our
> >>>> distributions can be installed together, sometimes separately - and we
> >>> need
> >>>> to handle a lot of complexity and implement some hacks to make this
> >>>> "common" initialization to work in all scenarios.
> >>>>
> >>>> And it leads to a number of complexities and problems we (and our
> >> users)
> >>>> often experience:
> >>>>
> >>>> * there are often "module not fully initialized" errors that are
> >>> difficult
> >>>> to debug and fix when we are trying to import parts of airflow from
> >> other
> >>>> modules that are "being initialized" (logging, secrets managers are
> >>>> particularly susceptible to that) - we have a lot of "local imports"
> >> and
> >>>> other ways to deal with it.
> >>>>
> >>>> * we have a lot of "lazy-loading" implemented - in both production
> code
> >>> and
> >>>> tests - just to handle the conditional nature of some things - for
> >>>> example @provider_configurations_loaded decorator is implemented
> >>>> specifically to defer initializing providers when they are going to be
> >>>> used. This is not the "best" pattern but one that works in the
> >>>> circumstances of init doing a lot  - and it's a direct result of us
> >> doing
> >>>> this heavy initialisation. It could have been simplified if we do
> >>> explicit
> >>>> initialization of things when needed in specific CLI commands
> >>>>
> >>>> * our "plugins" interface that used to be "all-in-one" is now pretty
> >>>> fragmented across what needs to be initialized where. While Scheduler
> >>> needs
> >>>> "timetable" plugins, it does not need "macros" nor "fast_api_apps" and
> >> it
> >>>> should not initialize them, but "webserver" on the other hand needs
> >>>> "fast_api_apps" and worker also needs "global_operator_links" (this is
> >> a
> >>>> recent change I think - they used to be rendered in web server).
> >>>>
> >>>> * we have hard time on deciding when we should do certain parts of
> >>>> initialization - for example currently plugins manager is initialized
> >> in
> >>>> "import airflow" effectively - and it means that the only way to find
> >> out
> >>>> what is the "cli" command we run is look at the arguments of
> >> interpreter
> >>> -
> >>>> so that we can "guess" if we are run as worker or api_server - because
> >>>> after the split, we are not supposed to always initialize all plugins
> -
> >>> so
> >>>> current implementation in  #52952 is ....weird.... out of necessity:
> >>>>
> >>>> # Load the API endpoint only on api-server (Airflow 3.x) or webserver
> >>>> (Airflow 2.x)
> >>>> if AIRFLOW_V_3_0_PLUS:
> >>>>     RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if
> >> len(sys.argv)
> >>>> 1 else False
> >>>> else:
> >>>>     RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and
> >>>> "airflow-webserver" in sys.argv
> >>>>
> >>>> *Now, how to fix it? *
> >>>>
> >>>> I think the answer is in Python Zen "explicit is better than
> >> implicit". I
> >>>> think we could simplify a lot of code if we drop the assumption that
> >>>> "import airflow" does everything for you. In fact it should do pretty
> >>> much
> >>>> **nothing**. Then whenever a particular CLI of airflow is run, we
> >> should
> >>>> explicitly initialize whatever we need.
> >>>>
> >>>> Say:
> >>>>
> >>>> * airflow api_server -> configuration, settings, database,
> >>> fast_api_server
> >>>> and main "airflow" app
> >>>> * celery worker -> configuration, settings, task_sdk, fast_api_server
> >> and
> >>>> "serve_logs" app, "macro plugins". "global_operator_links",
> >>>> * scheduler -> configuration, settings, database, timetable plugins,
> >>>>
> >>>> etc. etc.  In always the right sequence (this matters a lot and it is
> >>>> currently one of the sources of problems that depending which package
> >> you
> >>>> import first our lazy loading might work differently), with minimal
> >> lazy
> >>>> loading - i.e minimal implicitness.
> >>>>
> >>>> I attempted to do it partially in the past (I guess 3 times) and
> failed
> >>>> miserably because of intermixing of configuration, settings and
> >> database
> >>> -
> >>>> but with a lot of work being done on task isolation, I think a lot of
> >> the
> >>>> roadblocks there are either being handled or handled already.
> >>>>
> >>>> Also I think it's not a "breaking" change. We never actually promised
> >>> that
> >>>> "import airflow" does all the initialization. If this is relied on -
> >> it's
> >>>> mostly in CI/ tests etc. and should be easily remediated by providing
> >>>> appropriate initialization calls  (and appropriate sequence of those
> >>>> initializations.
> >>>>
> >>>> I am happy to lead that effort if we agree this is a good direction.
> It
> >>>> might already also be kind of planned (explicitly or implicitly) as
> >> part
> >>> of
> >>>> task isolation work - so maybe what I am writing about have already
> >> been
> >>>> taken into account (but I have not seen it explicitly addressed) and I
> >> am
> >>>> happy to help there as well.
> >>>>
> >>>> I would love to hear your opinions on that.
> >>>>
> >>>> J.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail:dev-unsubscr...@airflow.apache.org
> >>> For additional commands, e-mail:dev-h...@airflow.apache.org
> >>>
> >>>
>

Reply via email to