Hello Ash, First of all, don't apologize for being "harsh" as I don't perceive it that way. I also fully understand that as a maintainer you don't want to add more (and maybe unnecessary) complexity than necessary, especially if it feels like a work around for a specific problem.
As I said, regarding the name, we could indeed rename it as the IteratorOperator for example and rename the stream method the iterate, which would indeed make more sense. I don't have strong opinions on the name because as I said, I got the inspiration from the Java 8 stream API, which there you could also argue if the name they gave to that functionality is the correct one. I kept this name to make the link between the problem that was solved there and the same problem I wanted to solve with operators in Airflow. I agree with you on all you points, except the last one with the for-loop, because there it's a naïve solution to the problem. How are you going to loop over deferrable operators? First of all, you'll need to understand as a DAG author that the operator will raise a TaskDeferred exception, which means the simple for-loop approach won't work. Then you will realise that in fact you need to invoke the trigger within the TaskDeferred exception used by the deferrable operator, but that code is async, so there you'll have to cope with the event loop and finally, you will probably realise that at the end of that, the deferred operator will also need to call the next method (if defined). And even worse, you trigger could raise another TaskDeferred exception, which make it even more complex to handle (now you have recursion to handle). For example, the MSGraphAsyncOperator (which I wrote hence why I know it so well) works like that as it implements the producer/consumer pattern through an async triggerer (I think there is also a proposition for that to have a base implementation for this in Airflow as it didn't exist when I implemented it for the MSGraphOperator) when a REST API returns results through multiple pages, that way you don't block your workers waiting for the next response/page. That's where the StreamedOperator (or IteratorOperator if you prefer) is handy, it encapsulates all this complexity in a generic solution for you which you don't need to worry about and which you can apply on any operator, just like the partial/expand, which will not be possible with custom code in a PythonOperator. The custom for-loop (which isn't concurrent btw, you could implement it yourself of course as I showed with the snippet in slack), you also don't store the state of each processed item in the for-loop, if the PythonOperator fails, then you'll also need to re-execute it completely. I see it as an alternative way of processing multiple inputs over operators, which is more than a simple for-loop in a PythonOperator which loops over some hook invocations for example, as that is the simple use case. Also, by looping over the operators within a PythonOperator, you would also violate this concept (https://lists.apache.org/thread/nflt9h6dc5obzztmyqxlpxfs950rtqsq), which I also implemented the PR (https://github.com/apache/airflow/pull/37937) for once Jarek mentioned it as this was also an issue we had to deal with when reviewing DAG's. And like Joffrey mentioned, as an Airflow infra maintainer, you don't want to see such (very) technical code to solve a technical problem which can't be solved through the expand functionality. Kind regards, David -----Original Message----- From: Ash Berlin-Taylor <a...@apache.org> Sent: Tuesday, 3 December 2024 11:44 To: dev@airflow.apache.org Subject: Re: [PROPOSAL] Add streaming support to PartialOperator EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur deze e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>. Hi David, As it stands today I’m -1 to accepting this for a couple of reasons, sorry: First of all: the implementation looks like it is a “parallel implementation” of the scheduler, triggerer. I know that it is in some ways a POC only, but there is more for my reasoning. It feels counter to mapped tasks — the entire point of using mapped tasks is to run a single task (or a group of tasks) over a repeated input, and have each one be able to run independently, be restarted or retried independently and to scale out independently. Right now we loose almost all of those benefits. On the name, “stream” is 100% the wrong word for this concept as it is not streaming to process data as it comes in, but the opposite almost, it’s batching it all up to run in one group. “Iterate” is much better. From chatting to David on slack it feels to me like this entire feature is built to work around a problem with a large number of mapped tasks. So I’m -1 to accepting this in the core project and we should instead spend our maintenance effort on improving mapped tasks. It can be maintained as a separate operator in a provider out of tree, and if it gains traction we could see about bringing it into core as an apache-maintained provider. In this particular case it also feels like it could be achieved with “executor=‘Celery’” + KEDA to scale the worker and get 90% of the same behaviour without any changes at all, or as Daniel suggested earlier, simply do this in a plain old `for` loop inside a task. The other idea might be to be able to have a mapped task go directly into a triggered — that might also gain the performance you want. Please let me know if I’ve misunderstood anything about your proposal, and sorry to be harsh, but one of the hardest things as a maintainer of an open source project is saying no to feature requests. -ash > On 6 Nov 2024, at 11:21, Blain David <david.bl...@infrabel.be> wrote: > > Hello guys, > > First of all, thank you all for taking your time and giving your opinions and > insights regarding my proposal. I also think it would indeed be better to do > an official AIP proposal. I just planted the seed here to see how this > proposal would be received. I will try to do this as soon as possible. > > Kind regards, > David > > From: Constance Martineau <consta...@astronomer.io> > Sent: Wednesday, 16 October 2024 23:06 > To: dev@airflow.apache.org > Cc: Blain David <david.bl...@infrabel.be> > Subject: Re: [PROPOSAL] Add streaming support to PartialOperator > > You don't often get email from > consta...@astronomer.io<mailto:consta...@astronomer.io>. Learn why > this is important<https://aka.ms/LearnAboutSenderIdentification> > > > EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet > vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur > deze e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>. > That was a lot to read through, and to be honest, it's hard for me to tell > whether or not Jarek's proposal solves David's problem. However, if the > debate is whether it's worthwhile or not to provide a first-class way for DAG > authors to use Operators as part of TaskFlow Tasks, it is. > > Operators are a major value-add to the Airflow Ecosystem, and we're > implicitly forcing DAG authors to choose whether they value their DAGs being > pythonic and simpler to read and reason (Taskflow), or whether they value > limiting custom code (Traditional Syntax with Operators). There should be a > first class way to have both, and while it's possible to have dependencies > between decorated tasks and traditional tasks (and use hooks within tasks), > you lose a lot of the benefits and it's easier to revert to traditional > syntax. > > On Tue, Oct 15, 2024 at 2:46 PM Jens Scheffler > <j_scheff...@gmx.de.invalid<mailto:j_scheff...@gmx.de.invalid>> wrote: > Hi all, > > thanks for picking-up the discussion. So following the email chain a > bit I would recommend to spin an AIP for the implementation. There > might be one or multiple cases where this is a cool feature. Still it > will add complexity and needs a closer discussion. The best discussion > might be on the AIP itself and then once all questions and details are > described we still can VOTE on it. > > @David, can you follow as described in > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik > i.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAirflow%2BImprovement% > 2BProposals&data=05%7C02%7Cdavid.blain%40infrabel.be%7C4c609779ca854eb > dd47d08dd138764a6%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C6386881 > 94471380080%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLj > AuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7 > C&sdata=WlsUzIOy5SPV%2F5RFI3zGlGzEwm5U6MlTKRLjxQetzEE%3D&reserved=0 > ? > > (I also have another use case in mind and am courious if the propsal > would also support this) > > Jens > > On 15.10.24 18:24, Daniel Standish wrote: >> RE SLAs there was actually a lot of people who chimed in and >> expressed concerns with the approach, but no one took the step of >> actually down voting it. It's hard to down vote and say no this does not >> seem right. >> And sometimes these things gain a momentum and you don't want to be a >> stick in the mud, particularly if you don't have a better solution >> and someone has spent a lot of time on it. But yeah we should not be >> so timid about saying no that we never do it. >> >> I think I did not really engage with it until substantially later in >> the process, wish I could have engaged earlier. >> >> On the topic of streaming, yeah, I'm trying to do my part to engage >> in this thread. I don't yet see and understand the value so that's >> why I suggested fleshing out the proposal in a doc. I'm not ready to >> give any thumbs up yet cus I'd don't see it. That doesn't mean the >> value isn't there, just I don't see it / understand it yet. >> >> And yeah we're only two people here engaging with this one so, it's >> good if others could consider the proposal also. But people only >> have so much time. And anyway, I think the proposal needs more >> clarity to be efficiently and accurately evaluated -- so formalizing >> it a bit, even if not precisely an AIP, would help in others to chime >> in. Really get into what problem it's solving and why and how. >> >> >> >> >> >> >> >> >> >> On Tue, Oct 15, 2024 at 9:05 AM Jarek Potiuk >> <ja...@potiuk.com<mailto:ja...@potiuk.com>> wrote: >> >>> So I think what David really needs (from you Daniel and others) if >>> is the idaa sounds right, if it does and we agree it is something >>> that should be clarified in detail and there are no major blockers >>> to move in this direction - this can be turned into detailed >>> proposal with the syntax, >>> >>> I think we had a long story of some cases (like SLA) where we asked >>> for detailed AIPs and then after it has been delivered it turned out >>> that the idea from the very beginning was not right, but this >>> feedback has been missing. SLA feature sufferred from late feedback >>> that "the whole idea seems wrong". >>> >>> I think we should avoid such an approach. If we see that the general >>> idea is wrong we should give early feedback - and then engage in >>> detailed discussion - but without the "I have not paid attention >>> before but the whole thing is wrong". >>> >>> I think David is looking for this kind of confirmation, so that he >>> does not spend days and weeks on detailing a proposal then was >>> strangled to death because we did not like the idea in the first >>> place. That's very discouraging. >>> >>> J, >>> >>> On Tue, Oct 15, 2024 at 6:00 PM Jarek Potiuk >>> <ja...@potiuk.com<mailto:ja...@potiuk.com>> wrote: >>> >>>> It's about the same David's proposal is about stream syntax to run >>>> the operators in the task. So those are not two things - this is the "idea" >>>> (run operators in a loop in a task) and implementation detail >>>> (stream syntax). >>>> >>>> I think at this stage I distilled the idea from the syntax >>>> proposal, and what we could do in the future is to make sure that syntax >>>> is good. >>>> >>>> >>>> J. >>>> >>>> >>>> On Tue, Oct 15, 2024 at 4:11 PM Daniel Standish >>>> <daniel.stand...@astronomer.io.invalid<mailto:daniel.stand...@astronomer.io.invalid>> >>>> wrote: >>>> >>>>> I'm still a bit fuzzy on the proposal. It also seems at times >>>>> like you two (David and Jarek) are sorta talking about two >>>>> different things. David: >>>>> "stream" syntax. Jarek: run operator in a task. >>>>> >>>>> I would suggest @David maybe just produce a sort of draft AIP >>>>> maybe in google docs or something and share and interested parties >>>>> can review and understand better and possibly help shape the direction. >>>>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > dev-unsubscr...@airflow.apache.org<mailto:dev-unsubscr...@airflow.apac > he.org> For additional commands, e-mail: > dev-h...@airflow.apache.org<mailto:dev-h...@airflow.apache.org> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org