RE: [PROPOSAL] Add streaming support to PartialOperator

Blain David Fri, 04 Oct 2024 00:30:56 -0700

We cleary want to position ourselves as the second group. Not that we don't 
write python code, as a fact we do, but once we see something could be solved 
in a generic way and be contributed back to Airflow, we will also propose that. 
 I understand the use of hooks principle, we also do it, when the operator just 
don't fit in the usage because of too technical difficulties, but as I already 
told, we want to keep custom code as minimal as possible.

The reasoning behind it is readability and maintenance, and not all the people 
who will write a DAG are very technical people.  The usage of operators is well 
documented in Airflow/Astronomer, so for those people the step to write a DAG 
isn't that hard.

My first example in the PR was using the "stream" functionality (maybe the 
method should be called "iterate" instead) on a PythonOperator, which was 
actually a bad example, as the looping could be easily done within the task 
decorated method itself.  Then I changed the "simplified" example with the 
MSGraphAsyncOperator.  And there, as also mentioned by Jarek, some operators do 
validation and post-processing outside of the hook which is also the case with 
the MSGraphAsyncOperator as many other operators do.  This is not because the 
operator is badly written, it's due to the nature of the DeferredTask and thus 
a combination between an operator and a triggerer which serves as one async 
operator, as this is how the deferred task mechanism is designed in Airflow.  
If you would have to do the same logic in a PythonOperator or a task decorated 
method, it would become very messy code (as the code is async), you would have 
to re-invent the whole post-processing e.g. the next_method part done by the 
operator as a result of the triggerer.  That's why I think this functionality 
could be handy, the complexity of handling async code and managing the 
parallelism and deferred code mechanism is handled for you, just like the 
dynamic task mapping does it, but then executed within the same worker 
concurrently (if wanted, could be sequentially as you can parametrize it).  You 
could also apply the same functionality on a task decorated method, and focus 
on the business code instead of solving technical issues.

We now at our company have lots of DAG's with custom code to solve those 
issues, and as I told in the beginning, we want to make our DAG's as clean and 
simple as possible, avoid code duplication and re-use the Airflow capabilities 
to the max without re-inventing the wheel.  That's why I thought of this 
solution, and I thought it could be a nice feature in the future for Airflow 
3.x.  We are already using this in our Airflow instances, the gain in 
performance is huge and the DAG's are much easier to understand with less 
technical code, as the problem has been solved in a generic way.  That's also 
why I've implemented it with the functionality available in Airflow 2.x as a 
POC, but as already mentioned, it would indeed be better to redo it the clean 
way in Airflow 3.x. That's why now that I know it's possible in Airflow 2.x, I 
wanted to come back to you guys with this proposition. I also think it's not a 
good idea to still implement it in Airflow 2.x as Airflow 3.0 is on it's way. I 
also see this as a new possible feature for Airflow 3.x, which would be an 
additional argument to do the migration to 3.x.  Anyway, as an Airflow user and 
contributor, I'm willing to help here.  For us it doesn't matter, as we can 
already use it, but it could nice that other Aiflow users could also benefit 
from this functionality, as I know this topic has been discussed many times.

-----Original Message-----
From: Jarek Potiuk <ja...@potiuk.com> 
Sent: Friday, October 4, 2024 5:52 AM
To: dev@airflow.apache.org
Subject: Re: [PROPOSAL] Add streaming support to PartialOperator

EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet 
vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur deze 
e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>.

> why not just do things sequentially in a loop inside of a task?

Yes I think you nailed it - and I think it's just the abstraction you use in 
this case.

When you loop in the task to do a small thing many times with one of the 
integrations of Airflow - you could use Hook for that. But - apparently - this 
abstraction is difficult to discover and possibly sometimes difficult to 
discover for the user because all they know is "operators" and they do not know 
what Hooks do.

So .. the natural way to interact with external integration for many of our 
users is via Operators - so to allow such looping using operators sounds like 
"follow what is natural for your users".

Basically.- we are not telling the users "Use hooks", but we are following what 
our users want to do -  "use operators" in this case as it feels more natural 
for them. I think - now when i think of that - it simply shows that we have two 
kinds of users in this case:

1) those who know and are happy to write custom operators (they will use
hooks)
2) those who are more comfortable in just putting together existing building 
blocks - i.e. operators (all they know are dag, operators, dependencies - and 
when they come to composing things they think of task flow as the way to 
compose the things they know

Clearly -> allowing to use operators in task flow in this mode would respond to 
the 2nd group of the users.

For me this is kinda model leadership we should do - when you as a leader in a 
space try to convince others to do things in one way, but pretty much everyone 
is not following and stubbornly attempt to use the thing you think is wrong, 
maybe it's a good time to think "well maybe they are right".

J.

On Thu, Oct 3, 2024 at 8:10 PM Daniel Standish 
<daniel.stand...@astronomer.io.invalid> wrote:

> The thing i'm having trouble with is that the problem the user, David, 
> is trying to solve is basically, that airflow doesn't like super fine-grained
> tasks.   Like let's push this to the limit.  I run an ecommerce company
> that has 10M visitors per day and each time they visit we update the 
> visitor table.  I want to run a daily job to process updates.  Should 
> I model my pipeline as 1 task per customer?  Probably not a good idea.
>
> There's a reason e.g. that databases exist and you can do things in a 
> set-based way.  There seems to be an analogy here with David's example.
> That's why I asked why model it so fine grained.  He does not seem to 
> want to write a custom operator, but it would seem it's probably a 
> good idea here.  One way of thinking about the use case is, I want to 
> do things sequentially in a loop -- why not just do things 
> sequentially in a loop inside of a task?
>

RE: [PROPOSAL] Add streaming support to PartialOperator

Reply via email to