Re: [PROPOSAL] Add streaming support to PartialOperator

Jarek Potiuk Tue, 03 Dec 2024 03:15:26 -0800

I also agree maybe the intentions here are going in the right direction,
but reasoning why something like that is needed is not "streaming" and not
"solve problems with mapped operators".


For me - and  maybe I am looking at this proposal through my own looking
glass and completely changing the intention - this might be a beginning of
discussing how we could think about "task affinity" - where we want to run
several dependent tasks (task group?) of Airflow at the same machine -
because we want to share in-memory data between all the tasks. And if I am
derailing the discussion too much - I will shut up, because while I think
we need it eventually (it's been raised multiple times in various
discussions before) but we have pretty much no capacity to implement it now
while doing Airflow 3 (unless someone who does not do Airflow 3 would like
to make a POC of it without big hopes of a lot of meaningful discussion
till February or so).

While the reasoning and use case explained by David initially is different
than that - I think the vision shares interesting properties with what I
think we eventually need: "sub-workflow (task group?)" of airflow tasks
independently visualised in UI, but running together (sequentially or in
parallel as needed) on a single machine where CPU and GPU memory can be
shared between the tasks. That IMHO will open a number of optimisations
that will be needed for some of the low-level machine-learning/AI workflows
that modern building-blocks of the machine learning "ecosystem" allow
(thanks to standards like Apache Arrow - where different tools operate on
the in-memory data with 0-copy overhead basically). And while it's not what
David proposed, really, it also might achieve the same performance gains
that Daniel wants to achieve in the use case of his (almost as a
by-product).

And yes - I strongly agree with Ash that "streaming" is a very bad naming
for either of the two cases - the one that David explained and mine.

But  again - I might be derailing the discussion, so I will shut-up if
that's the case.

J.




On Tue, Dec 3, 2024 at 11:43 AM Ash Berlin-Taylor <a...@apache.org> wrote:

> Hi David,
>
> As it stands today I’m -1 to accepting this for a couple of reasons, sorry:
>
> First of all: the implementation looks like it is a “parallel
> implementation” of the scheduler, triggerer. I know that it is in some ways
> a POC only, but there is more for my reasoning.
>
> It feels counter to mapped tasks — the entire point of using mapped tasks
> is to run a single task (or a group of tasks) over a repeated input, and
> have each one be able to run independently, be restarted or retried
> independently and to scale out independently. Right now we loose almost all
> of those benefits.
>
> On the name, “stream” is 100% the wrong word for this concept as it is not
> streaming to process data as it comes in, but the opposite almost, it’s
> batching it all up to run in one group. “Iterate” is much better.
>
> From chatting to David on slack it feels to me like this entire feature is
> built to work around a problem with a large number of mapped tasks. So I’m
> -1 to accepting this in the core project and we should instead spend our
> maintenance effort on improving mapped tasks. It can be maintained as a
> separate operator in a provider out of tree, and if it gains traction we
> could see about bringing it into core as an apache-maintained provider.
>
> In this particular case it also feels like it could be achieved with
> “executor=‘Celery’” + KEDA to scale the worker and get 90% of the same
> behaviour without any changes at all, or as Daniel suggested earlier,
> simply do this in a plain old `for` loop inside a task.
>
> The other idea might be to be able to have a mapped task go directly into
> a triggered — that might also gain the performance you want.
>
> Please let me know if I’ve misunderstood anything about your proposal, and
> sorry to be harsh, but one of the hardest things as a maintainer of an open
> source project is saying no to feature requests.
>
> -ash
>
> > On 6 Nov 2024, at 11:21, Blain David <david.bl...@infrabel.be> wrote:
> >
> > Hello guys,
> >
> > First of all, thank you all for taking your time and giving your
> opinions and insights regarding my proposal.  I also think it would indeed
> be better to do an official AIP proposal.  I just planted the seed here to
> see how this proposal would be received.  I will try to do this as soon as
> possible.
> >
> > Kind regards,
> > David
> >
> > From: Constance Martineau <consta...@astronomer.io>
> > Sent: Wednesday, 16 October 2024 23:06
> > To: dev@airflow.apache.org
> > Cc: Blain David <david.bl...@infrabel.be>
> > Subject: Re: [PROPOSAL] Add streaming support to PartialOperator
> >
> > You don't often get email from consta...@astronomer.io<mailto:
> consta...@astronomer.io>. Learn why this is important<
> https://aka.ms/LearnAboutSenderIdentification>
> >
> >
> > EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze
> niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel,
> stuur deze e-mail als bijlage naar ab...@infrabel.be<mailto:
> ab...@infrabel.be>.
> > That was a lot to read through, and to be honest, it's hard for me to
> tell whether or not Jarek's proposal solves David's problem. However, if
> the debate is whether it's worthwhile or not to provide a first-class way
> for DAG authors to use Operators as part of TaskFlow Tasks, it is.
> >
> > Operators are a major value-add to the Airflow Ecosystem, and we're
> implicitly forcing DAG authors to choose whether they value their DAGs
> being pythonic and simpler to read and reason (Taskflow), or whether they
> value limiting custom code (Traditional Syntax with Operators). There
> should be a first class way to have both, and while it's possible to have
> dependencies between decorated tasks and traditional tasks (and use hooks
> within tasks), you lose a lot of the benefits and it's easier to revert to
> traditional syntax.
> >
> > On Tue, Oct 15, 2024 at 2:46 PM Jens Scheffler
> <j_scheff...@gmx.de.invalid<mailto:j_scheff...@gmx.de.invalid>> wrote:
> > Hi all,
> >
> > thanks for picking-up the discussion. So following the email chain a bit
> > I would recommend to spin an AIP for the implementation. There might be
> > one or multiple cases where this is a cool feature. Still it will add
> > complexity and needs a closer discussion. The best discussion might be
> > on the AIP itself and then once all questions and details are described
> > we still can VOTE on it.
> >
> > @David, can you follow as described in
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> > ?
> >
> > (I also have another use case in mind and am courious if the propsal
> > would also support this)
> >
> > Jens
> >
> > On 15.10.24 18:24, Daniel Standish wrote:
> >> RE SLAs there was actually a lot of people who chimed in and expressed
> >> concerns with the approach, but no one took the step of actually down
> >> voting it.  It's hard to down vote and say no this does not seem right.
> >> And sometimes these things gain a momentum and you don't want to be a
> stick
> >> in the mud, particularly if you don't have a better solution and someone
> >> has spent a lot of time on it.  But yeah we should not be so timid about
> >> saying no that we never do it.
> >>
> >> I think I did not really engage with it until substantially later in the
> >> process, wish I could have engaged earlier.
> >>
> >> On the topic of streaming, yeah, I'm trying to do my part to engage in
> this
> >> thread.  I don't yet see and understand the value so that's why I
> suggested
> >> fleshing out the proposal in a doc. I'm not ready to give any thumbs up
> yet
> >> cus I'd don't see it.  That doesn't mean the value isn't there, just I
> >> don't see it / understand it yet.
> >>
> >> And yeah we're only two people here engaging with this one so, it's
> good if
> >> others could consider the proposal also.  But people only have so much
> >> time.  And anyway, I think the proposal needs more clarity to be
> >> efficiently and accurately evaluated -- so formalizing it a bit, even if
> >> not precisely an AIP, would help in others to chime in.  Really get into
> >> what problem it's solving and why and how.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Oct 15, 2024 at 9:05 AM Jarek Potiuk <ja...@potiuk.com<mailto:
> ja...@potiuk.com>> wrote:
> >>
> >>> So I think what David really needs (from you Daniel and others) if is
> the
> >>> idaa sounds right, if it does and we agree it is something that should
> be
> >>> clarified in detail and there are no major blockers to move in this
> >>> direction - this can be turned into detailed proposal with the syntax,
> >>>
> >>> I think we had a long story of some cases (like SLA) where we asked for
> >>> detailed AIPs and then after it has been delivered it turned out that
> the
> >>> idea from the very beginning was not right, but this feedback has been
> >>> missing. SLA feature sufferred from late feedback that "the whole idea
> >>> seems wrong".
> >>>
> >>> I think we should avoid such an approach. If we see that the general
> idea
> >>> is wrong we should give early feedback - and then engage in detailed
> >>> discussion - but without the "I have not paid attention before but the
> >>> whole thing is wrong".
> >>>
> >>> I think David is looking for this kind of confirmation, so that he
> does not
> >>> spend days and weeks on detailing a proposal then was strangled to
> death
> >>> because we did not like the idea in the first place. That's very
> >>> discouraging.
> >>>
> >>> J,
> >>>
> >>> On Tue, Oct 15, 2024 at 6:00 PM Jarek Potiuk <ja...@potiuk.com<mailto:
> ja...@potiuk.com>> wrote:
> >>>
> >>>> It's about the same David's proposal is about stream syntax to run the
> >>>> operators in the task. So those are not two things - this is the
> "idea"
> >>>> (run operators in a loop in a task) and implementation detail (stream
> >>>> syntax).
> >>>>
> >>>> I think at this stage I distilled the idea from the syntax proposal,
> and
> >>>> what we could do in the future is to make sure that syntax is good.
> >>>>
> >>>>
> >>>> J.
> >>>>
> >>>>
> >>>> On Tue, Oct 15, 2024 at 4:11 PM Daniel Standish
> >>>> <daniel.stand...@astronomer.io.invalid<mailto:
> daniel.stand...@astronomer.io.invalid>> wrote:
> >>>>
> >>>>> I'm still a bit fuzzy on the proposal.  It also seems at times like
> you
> >>>>> two
> >>>>> (David and Jarek) are sorta talking about two different things.
> David:
> >>>>> "stream" syntax.  Jarek: run operator in a task.
> >>>>>
> >>>>> I would suggest @David maybe just produce a sort of draft AIP maybe
> in
> >>>>> google docs or something and share and interested parties can review
> and
> >>>>> understand better and possibly help shape the direction.
> >>>>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org<mailto:
> dev-unsubscr...@airflow.apache.org>
> > For additional commands, e-mail: dev-h...@airflow.apache.org<mailto:
> dev-h...@airflow.apache.org>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: [PROPOSAL] Add streaming support to PartialOperator

Reply via email to