I am all for accepting it. Naming etc. can be worked out but I am very much for adding mutliple optimisations to Airflow. And this one seems not only easy but also pretty obvious.
Just a comment here. I spent the last 2 weeks in the Bay Area and Community Over Code in Denver and I met a lot of people from our industry who are a lot smarter than me and who's experience and knowledge I tapped into to learn what is happening in our industry. People from the Apache Software Foundation (various data engineering, big data, ML/AI related projects), people - my old friends - Roman Shaposhnik - who is the VP Legal in the foundation - who is currently running startups in the ML/AI area especially (that is the hot area of course), Sid Anand - our PMC member, Greg Czajkowski - ex. VP engineering in Snowflake, ex. Google, ex. Sun/Java engineer and a number of other people. I also was a "Data Engineering" track lead at Community Over Code - 2 days, 16 talks, with countless discussions and questions with a number of Data Engineering tools, libraries, projects, etc. And there is one theme that I have been hearing over-and-over-and-over again. We are at the brink of getting into "optimization wars". Running ML/ LLM/AI workfloads - both training and inference related ones - is currently terribly, terribly expensive. It's not sustainable in the long run. And while the ideas about "what solutions should be used for LLM/ML/AI" are already pretty firm with transformer architecture pretty much overtaking everything else, everyone is now focusing on running whatever they are running in far more optimized ways - reusing GPUs for multiple parallel workflows - so that parts of the models can be loaded once to GPU and reused, Arrow used as ubiquitous format to store the data in order to utilize zero-copy Arrow's capabilities and chaining multiple tools using arrow to use the same data stored not only in memory but as of recently in GPU, co-locating data-e the centers from multiple clouds closer to each other in order to save the bandwidth and decrease latency, heck - even running all the workloads on K8S in the way that you can at any time move the stable workflows from Clouds to 3x cheaper on-premise K8S cluster. All that and more - and "optimizing the cost" (which translates to optimize all resource use) is all the rage now. A lot of people are also working on optimizing hardware - NVIDIA is currently the top player, but there are hardware contenders who are working on specialized hardware that will optimize some of the workfloads 100x (for example dedicated transformer-only-capable hardware). But it will take many years for the hardware to get adopted - because it needs quality APIs, compilers and adoption of the existing solutions and tools, so for now everyone is trying to optimize their processes on "data engineering" and software level - by optimizing out all the overhead of the processing introduced by the pipelines, libraries and tools they run their workflows with. This will be a theme for the next 2 or 3 years easily - and we have a chance to shine as the leader in that space - also by making it very easy and intuitive to optimize workflows that our users currently have. I think the solution proposed by David, is a very intuitive and easy way for the users to optimize their workflows. We should not stop there - I will start another discussion soon about one of the possible things we should consider implementing (unicorn executor), but I think the more options we give our users to easily optimise their workflows in Airflow 3.*, the more suitable it will be for some of the new workflows we have in mind for Airflow 3 (mainly inference related). J. On Mon, Oct 7, 2024 at 10:33 AM Blain David <david.bl...@infrabel.be> wrote: > Exactly Jarek, and yes Daniel, I use that max_active_tis_per_dag parameter > to control parallelism. The main reason for this solution is not the > parallelism itself, but to be able to loop within one same task instance > without overhead of Xcom's and task scheduling, which of course makes it (a > lot) faster. By default, it will use parallelism in my implementation, but > sometimes it's a necessity to not do it when doing IO but still want this > approach as the processing will still be faster than when doing it through > dynamic task mapping. > > I was aware that if that feature would be added, it would have been > Airflow 3.x anyway. As I mentioned before, I wanted a POC to know if it > was easily feasible without too much changes and also I didn't wanted to > wait until we have 3.0. That's why I tried to implement this generic > approach at our company to see if it: > > 1. Would be easily feasible to implement this without too much changes in > Airflow base code. > 2. Would offer better performance (always need to measure to know it). > 3. Would make the DAG code cleaner (less or no custom code required). > > Now that I know it's possible and all 3 points were check marked, we could > think of a better implementation in Airflow 3.x, if this feature would be > accepted of course. > > -----Original Message----- > From: Jarek Potiuk <ja...@potiuk.com> > Sent: Saturday, October 5, 2024 12:35 AM > To: dev@airflow.apache.org > Subject: Re: [PROPOSAL] Add streaming support to PartialOperator > > EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze > niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, > stuur deze e-mail als bijlage naar ab...@infrabel.be<mailto: > ab...@infrabel.be>. > > From the earlier discussions with David - this is also (and mainly) about > optimisation. Those operators do very little, and when you add total > overhead that Airflow adds for scheduling and running every task, then it > turns out that looping such operator's execute in a single interpreter is > many, many times faster (several orders of magnitude) than running them > sequentially as tasks. > > On Fri, Oct 4, 2024 at 10:16 AM Daniel Standish > <daniel.stand...@astronomer.io.invalid> wrote: > > > Well, it looks like we do have concurrency control for mapped tasks > > after all. > > > > See max_active_tis_per_dagrun which was added in > > https://github.com/apache/airflow/pull/29094. > > > > So this would allow you to map over your 3000 users in a single run, > > but process only one at a time (or 5 or 10 at a time). Does that help > > your use case? > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > For additional commands, e-mail: dev-h...@airflow.apache.org >