Nope. No need to defer the fix for edge. I think it's good enough for now as long as we have a clear road ahead :)
pon., 7 lip 2025, 21:58 użytkownik Jens Scheffler <j_scheff...@gmx.de.invalid> napisał: > Thanks Jarek for starting the discussion which I needed to think about > (and was busy all day) until reading the thread - is already well > evolved with details! > > I assume also we need to make 1-2 PoCs considering how this is made, > more with @decorators (like @requires[ORM] to signal what dependency is > needed (whereas this would need to be applied to many places in the > codebase... so a bit un-cool) or via very selectively init only (based > on the context of execution as sketched by Jarek below) - maybe still in > conjunction with lazy loading because we waste also a lot of time per > task just to get some lines of code going. Like if you call the classic > CLI it just alone takes 2-3 seconds until --help is responding. > > I would not prefer to move packages in the codebase. But we need to > cut-down what is really loaded - maybe pyspy can also help. And I assume > some extensive thinking is needed. What would help is also if we check > which code has what dependency - you call it the cyclomatic dependency - > which I think we have and need to cut-off. > > Yesterday I thought it might be a breaking change - agree to Jarek it > might be also possibly for a lot internally non-breaking - but finally > at point of providers the interface between providers and core might > need to change - e.g. init providers manager only from YAML information > and not loading/importing all classes at time of starting plugins_manager. > > I'd be also supportive and would like to engage - realistic would be 3.2 > to have a proper plan - unfortunately am pretty busy these times - so if > somebody wants to take the lead this would be cool! > > Jens > > P.S.: Does it mean the bug fix in #52952 is blocked until resolution or > can we agree to have this as intermediate until a proper support is > made... and potentially support for 3.1 is dropped in edge3? Until then > I assume there is no better way then the ugly code... > > On 07.07.25 16:27, Amogh Desai wrote: > > +1 for doing this. > > > > I recently went through some pain along similar veins and came to the > > conclusion that `import airflow` does a lot! > > > > Not a concrete plan, but a starting to investigate the > airflow/__init__.py > > and > > all other init's to see what is being initialised (config, ORM, logging, > > etc etc) would probably > > be a good starting point. > > > > We do have a decent initialise module but it is scattered, we should > > probably have a > > `airflow/initialization` or so module in my opinion with utils to do the > > hard work: > > - config > > - orm > > - logging > > - plugins etc > > > > Then start hunting down one CLI at a time :). Easier said than done > though! > > > > Thanks & Regards, > > Amogh Desai > > > > > > On Mon, Jul 7, 2025 at 6:41 PM Jarek Potiuk<ja...@potiuk.com> wrote: > > > >>> This might be related, or it might not be, but I think I would also > love > >> it if we moved all of “core” (scheduler, jobs, api server etc) to > >> airflow_core.* python modules , and out of `airflow.*` entirely. > (Meaning > >> `airflow` would be left for just `airflow.sdk` and `airflow.providers`, > >> plus some compat shims, possibly installed by apache-airflow-task-sdk > >> itself). Were you thinking something similar? > >> > >> I do not have exact details yet, it's more about "changing > >> the philosophy of initialisation". I think it would need some POC to > come > >> up with some details (but unfortunately such POC will require quite an > >> investment and when done it would be almost complete - as there are so > many > >> intertwined things in our initialization that you only find out stuff > after > >> you move things :) . That's my experience from previous attempts. > Usually > >> it started with - hey I can move this and that here and we will be good, > >> but after doing it, it turned out that the other parts have to be also > >> touched and it caused an avalanche of changes ripping through the whole > >> codebase almost (to the point that I gave up). > >> > >> But yes that might be one of the ways to achieve that. I am all for > trying > >> it and seeing how it might work out. > >> > >> J > >> > >> On Mon, Jul 7, 2025 at 2:49 PM Ash Berlin-Taylor<a...@apache.org> wrote: > >> > >>> Yeah, this has been a long time bugbear of mine and would love to > remove > >>> the magic and the side-effects of `import airflow`. > >>> > >>> Do you have any plans or thoughts about how to actually achieve this? > >>> > >>> This might be related, or it might not be, but I think I would also > love > >>> it if we moved all of “core” (scheduler, jobs, api server etc) to > >>> airflow_core.* python modules , and out of `airflow.*` entirely. > (Meaning > >>> `airflow` would be left for just `airflow.sdk` and `airflow.providers`, > >>> plus some compat shims, possibly installed by apache-airflow-task-sdk > >>> itself). Were you thinking something similar? > >>> > >>> -ash > >>> > >>>> On 7 Jul 2025, at 12:35, Jarek Potiuk<ja...@potiuk.com> wrote: > >>>> > >>>> I would like to raise another discussion here - about fixing > >>>> `airflow.__init__.py` excessive initialization pattern - potentially. > >> It > >>>> results from > >>>> https://github.com/apache/airflow/pull/52952#discussion_r2188492257 > >>>> discussion. > >>>> > >>>> This is something we have been seeing for quite some time in Airflow 1 > >>> and > >>>> 2 and now we still have some problems with it in Airflow 3, and I > think > >>>> with completing Task Isolation work, we have a chance to straighten it > >>> out. > >>>> Currently, we just do a LOT of stuff when we do `import airflow` - > >>>> initializing configurations, settings, secrets, registering ORM models > >> .. > >>>> you name it.. > >>>> > >>>> This is - likely as it has never been documented so I am guessing the > >>> root > >>>> cause now - result of the philosophy that "import airflow" should get > >> you > >>>> up and running and everything needed should be already "ready for > use". > >>>> This allows for example to open a REPL in python in airflow venv, do > >>>> "import airflow" - and everything you would like to do should be > >> possible > >>>> to do. And it's coming from the highly monolithic architecture of > >> Airflow > >>>> where we had just one package. And I think we do not have to hold to > >> this > >>>> assumption/expectation. > >>>> > >>>> The thing is that the whole environment is changing in Airflow 3 and > it > >>>> will change even further when task isolation is completed. We simply > do > >>> not > >>>> have a monolithic structure of packages and we have several > >> distributions > >>>> sharing "airflow" and they might or might not be installed together > >> which > >>>> adds a lot of complexity if we rely on "__init__.py" code being > >> executed. > >>>> While (years ago) I proposed in the past to make separate "top level" > >>>> packages (for example "airflow_providers" for providers) - this > >> proposal > >>>> has been rejected by the community and "airflow" became the common > >> "root" > >>>> package for everything, At the same time it causes that the common > >>>> "initialization" code is shared - but not really - because sometimes > >> our > >>>> distributions can be installed together, sometimes separately - and we > >>> need > >>>> to handle a lot of complexity and implement some hacks to make this > >>>> "common" initialization to work in all scenarios. > >>>> > >>>> And it leads to a number of complexities and problems we (and our > >> users) > >>>> often experience: > >>>> > >>>> * there are often "module not fully initialized" errors that are > >>> difficult > >>>> to debug and fix when we are trying to import parts of airflow from > >> other > >>>> modules that are "being initialized" (logging, secrets managers are > >>>> particularly susceptible to that) - we have a lot of "local imports" > >> and > >>>> other ways to deal with it. > >>>> > >>>> * we have a lot of "lazy-loading" implemented - in both production > code > >>> and > >>>> tests - just to handle the conditional nature of some things - for > >>>> example @provider_configurations_loaded decorator is implemented > >>>> specifically to defer initializing providers when they are going to be > >>>> used. This is not the "best" pattern but one that works in the > >>>> circumstances of init doing a lot - and it's a direct result of us > >> doing > >>>> this heavy initialisation. It could have been simplified if we do > >>> explicit > >>>> initialization of things when needed in specific CLI commands > >>>> > >>>> * our "plugins" interface that used to be "all-in-one" is now pretty > >>>> fragmented across what needs to be initialized where. While Scheduler > >>> needs > >>>> "timetable" plugins, it does not need "macros" nor "fast_api_apps" and > >> it > >>>> should not initialize them, but "webserver" on the other hand needs > >>>> "fast_api_apps" and worker also needs "global_operator_links" (this is > >> a > >>>> recent change I think - they used to be rendered in web server). > >>>> > >>>> * we have hard time on deciding when we should do certain parts of > >>>> initialization - for example currently plugins manager is initialized > >> in > >>>> "import airflow" effectively - and it means that the only way to find > >> out > >>>> what is the "cli" command we run is look at the arguments of > >> interpreter > >>> - > >>>> so that we can "guess" if we are run as worker or api_server - because > >>>> after the split, we are not supposed to always initialize all plugins > - > >>> so > >>>> current implementation in #52952 is ....weird.... out of necessity: > >>>> > >>>> # Load the API endpoint only on api-server (Airflow 3.x) or webserver > >>>> (Airflow 2.x) > >>>> if AIRFLOW_V_3_0_PLUS: > >>>> RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if > >> len(sys.argv) > >>>> 1 else False > >>>> else: > >>>> RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and > >>>> "airflow-webserver" in sys.argv > >>>> > >>>> *Now, how to fix it? * > >>>> > >>>> I think the answer is in Python Zen "explicit is better than > >> implicit". I > >>>> think we could simplify a lot of code if we drop the assumption that > >>>> "import airflow" does everything for you. In fact it should do pretty > >>> much > >>>> **nothing**. Then whenever a particular CLI of airflow is run, we > >> should > >>>> explicitly initialize whatever we need. > >>>> > >>>> Say: > >>>> > >>>> * airflow api_server -> configuration, settings, database, > >>> fast_api_server > >>>> and main "airflow" app > >>>> * celery worker -> configuration, settings, task_sdk, fast_api_server > >> and > >>>> "serve_logs" app, "macro plugins". "global_operator_links", > >>>> * scheduler -> configuration, settings, database, timetable plugins, > >>>> > >>>> etc. etc. In always the right sequence (this matters a lot and it is > >>>> currently one of the sources of problems that depending which package > >> you > >>>> import first our lazy loading might work differently), with minimal > >> lazy > >>>> loading - i.e minimal implicitness. > >>>> > >>>> I attempted to do it partially in the past (I guess 3 times) and > failed > >>>> miserably because of intermixing of configuration, settings and > >> database > >>> - > >>>> but with a lot of work being done on task isolation, I think a lot of > >> the > >>>> roadblocks there are either being handled or handled already. > >>>> > >>>> Also I think it's not a "breaking" change. We never actually promised > >>> that > >>>> "import airflow" does all the initialization. If this is relied on - > >> it's > >>>> mostly in CI/ tests etc. and should be easily remediated by providing > >>>> appropriate initialization calls (and appropriate sequence of those > >>>> initializations. > >>>> > >>>> I am happy to lead that effort if we agree this is a good direction. > It > >>>> might already also be kind of planned (explicitly or implicitly) as > >> part > >>> of > >>>> task isolation work - so maybe what I am writing about have already > >> been > >>>> taken into account (but I have not seen it explicitly addressed) and I > >> am > >>>> happy to help there as well. > >>>> > >>>> I would love to hear your opinions on that. > >>>> > >>>> J. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail:dev-unsubscr...@airflow.apache.org > >>> For additional commands, e-mail:dev-h...@airflow.apache.org > >>> > >>> >