RE: [PROPOSAL] Add streaming support to PartialOperator

Blain David Mon, 07 Oct 2024 01:33:19 -0700

Exactly Jarek, and yes Daniel, I use that max_active_tis_per_dag parameter to 
control parallelism.  The main reason for this solution is not the parallelism 
itself, but to be able to loop within one same task instance without overhead 
of Xcom's and task scheduling, which of course makes it (a lot) faster.  By 
default, it will use parallelism in my implementation, but sometimes it's a 
necessity to not do it when doing IO but still want this approach as the 
processing will still be faster than when doing it through dynamic task mapping.

I was aware that if that feature would be added, it would have been Airflow 3.x 
anyway.  As I mentioned before, I wanted  a POC to know if it was easily 
feasible without too much changes and also I didn't wanted to wait until we 
have 3.0.  That's why I tried to implement this generic approach at our company 
to see if it:

1. Would be easily feasible to implement this without too much changes in 
Airflow base code.
2. Would offer better performance (always need to measure to know it).
3. Would make the DAG code cleaner (less or no custom code required).

Now that I know it's possible and all 3 points were check marked, we could 
think of a better implementation in Airflow 3.x, if this feature would be 
accepted of course.

-----Original Message-----
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Saturday, October 5, 2024 12:35 AM
To: dev@airflow.apache.org
Subject: Re: [PROPOSAL] Add streaming support to PartialOperator

EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet 
vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur deze 
e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>.

From the earlier discussions with David - this is also (and mainly) about 
optimisation. Those operators do very little, and when you add total overhead 
that Airflow adds for scheduling and running every task, then it turns out that 
looping such operator's execute in a single interpreter is many, many times 
faster (several orders of magnitude) than running them sequentially as tasks.

On Fri, Oct 4, 2024 at 10:16 AM Daniel Standish 
<daniel.stand...@astronomer.io.invalid> wrote:

> Well, it looks like we do have concurrency control for mapped tasks
> after all.
>
> See max_active_tis_per_dagrun which was added in
> https://github.com/apache/airflow/pull/29094.
>
> So this would allow you to map over your 3000 users in a single run,
> but process only one at a time (or 5 or 10 at a time).  Does that help
> your use case?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

RE: [PROPOSAL] Add streaming support to PartialOperator

Reply via email to