Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

Kostas Kloudas Mon, 24 Aug 2020 02:30:42 -0700

Thanks a lot for the discussion!

I will open a voting thread shortly!


Kostas

On Mon, Aug 24, 2020 at 9:46 AM Kostas Kloudas <kklou...@apache.org> wrote:
>
> Hi Guowei,
>
> Thanks for the insightful comment!
>
> I agree that this can be a limitation of the current runtime, but I
> think that this FLIP can go on as it discusses mainly the semantics
> that the DataStream API will expose when applied on bounded data.
> There will definitely be other FLIPs that will actually handle the
> runtime-related topics.
>
> But it is good to document them nevertheless so that we start soon
> ironing out the remaining rough edges.
>
> Cheers,
> Kostas
>
> On Mon, Aug 24, 2020 at 9:16 AM Guowei Ma <guowei....@gmail.com> wrote:
> >
> > Hi, Klou
> >
> > Thanks for your proposal. It's a very good idea.
> > Just a little comment about the "Batch vs Streaming Scheduling".  In the 
> > AUTOMATIC execution mode maybe we could not pick BATCH execution mode even 
> > if all sources are bounded. For example some applications would use the 
> > `CheckpointListener`, which is not available in the BATCH mode in current 
> > implementation.
> > So maybe we need more checks in the AUTOMATIC execution mode.
> >
> > Best,
> > Guowei
> >
> >
> > On Thu, Aug 20, 2020 at 10:27 PM Kostas Kloudas <kklou...@apache.org> wrote:
> >>
> >> Hi all,
> >>
> >> Thanks for the comments!
> >>
> >> @Dawid: "execution.mode" can be a nice alternative and from a quick
> >> look it is not used currently by any configuration option. I will
> >> update the FLIP accordingly.
> >>
> >> @David: Given that having the option to allow timers to fire at the
> >> end of the job is already in the FLIP, I will leave it as is and I
> >> will update the default policy to be "ignore processing time timers
> >> set by the user". This will allow existing dataStream programs to run
> >> on bounded inputs. This update will affect point 2 in the "Processing
> >> Time Support in Batch" section.
> >>
> >> If these changes cover your proposals, then I would like to start a
> >> voting thread tomorrow evening if this is ok with you.
> >>
> >> Please let me know until then.
> >>
> >> Kostas
> >>
> >> On Tue, Aug 18, 2020 at 3:54 PM David Anderson <da...@alpinegizmo.com> 
> >> wrote:
> >> >
> >> > Being able to optionally fire registered processing time timers at the 
> >> > end of a job would be interesting, and would help in (at least some of) 
> >> > the cases I have in mind. I don't have a better idea.
> >> >
> >> > David
> >> >
> >> > On Mon, Aug 17, 2020 at 8:24 PM Kostas Kloudas <kklou...@apache.org> 
> >> > wrote:
> >> >>
> >> >> Hi Kurt and David,
> >> >>
> >> >> Thanks a lot for the insightful feedback!
> >> >>
> >> >> @Kurt: For the topic of checkpointing with Batch Scheduling, I totally
> >> >> agree with you that it requires a lot more work and careful thinking
> >> >> on the semantics. This FLIP was written under the assumption that if
> >> >> the user wants to have checkpoints on bounded input, he/she will have
> >> >> to go with STREAMING as the scheduling mode. Checkpointing for BATCH
> >> >> can be handled as a separate topic in the future.
> >> >>
> >> >> In the case of MIXED workloads and for this FLIP, the scheduling mode
> >> >> should be set to STREAMING. That is why the AUTOMATIC option sets
> >> >> scheduling to BATCH only if all the sources are bounded. I am not sure
> >> >> what are the plans there at the scheduling level, as one could imagine
> >> >> in the future that in mixed workloads, we schedule first all the
> >> >> bounded subgraphs in BATCH mode and we allow only one UNBOUNDED
> >> >> subgraph per application, which is going to be scheduled after all
> >> >> Bounded ones have finished. Essentially the bounded subgraphs will be
> >> >> used to bootstrap the unbounded one. But, I am not aware of any plans
> >> >> towards that direction.
> >> >>
> >> >>
> >> >> @David: The processing time timer handling is a topic that has also
> >> >> been discussed in the community in the past, and I do not remember any
> >> >> final conclusion unfortunately.
> >> >>
> >> >> In the current context and for bounded input, we chose to favor
> >> >> reproducibility of the result, as this is expected in batch processing
> >> >> where the whole input is available in advance. This is why this
> >> >> proposal suggests to not allow processing time timers. But I
> >> >> understand your argument that the user may want to be able to run the
> >> >> same pipeline on batch and streaming this is why we added the two
> >> >> options under future work, namely (from the FLIP):
> >> >>
> >> >> ```
> >> >> Future Work: In the future we may consider adding as options the 
> >> >> capability of:
> >> >> * firing all the registered processing time timers at the end of a job
> >> >> (at close()) or,
> >> >> * ignoring all the registered processing time timers at the end of a 
> >> >> job.
> >> >> ```
> >> >>
> >> >> Conceptually, we are essentially saying that we assume that batch
> >> >> execution is assumed to be instantaneous and refers to a single
> >> >> "point" in time and any processing-time timers for the future may fire
> >> >> at the end of execution or be ignored (but not throw an exception). I
> >> >> could also see ignoring the timers in batch as the default, if this
> >> >> makes more sense.
> >> >>
> >> >> By the way, do you have any usecases in mind that will help us better
> >> >> shape our processing time timer handling?
> >> >>
> >> >> Kostas
> >> >>
> >> >> On Mon, Aug 17, 2020 at 2:52 PM David Anderson <da...@alpinegizmo.com> 
> >> >> wrote:
> >> >> >
> >> >> > Kostas,
> >> >> >
> >> >> > I'm pleased to see some concrete details in this FLIP.
> >> >> >
> >> >> > I wonder if the current proposal goes far enough in the direction of 
> >> >> > recognizing the need some users may have for "batch" and "bounded 
> >> >> > streaming" to be treated differently. If I've understood it 
> >> >> > correctly, the section on scheduling allows me to choose STREAMING 
> >> >> > scheduling even if I have bounded sources. I like that approach, 
> >> >> > because it recognizes that even though I have bounded inputs, I don't 
> >> >> > necessarily want batch processing semantics. I think it makes sense 
> >> >> > to extend this idea to processing time support as well.
> >> >> >
> >> >> > My thinking is that sometimes in development and testing it's 
> >> >> > reasonable to run exactly the same job as in production, except with 
> >> >> > different sources and sinks. While it might be a reasonable default, 
> >> >> > I'm not convinced that switching a processing time streaming job to 
> >> >> > read from a bounded source should always cause it to fail.
> >> >> >
> >> >> > David
> >> >> >
> >> >> > On Wed, Aug 12, 2020 at 5:22 PM Kostas Kloudas <kklou...@apache.org> 
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi all,
> >> >> >>
> >> >> >> As described in FLIP-131 [1], we are aiming at deprecating the 
> >> >> >> DataSet
> >> >> >> API in favour of the DataStream API and the Table API. After this 
> >> >> >> work
> >> >> >> is done, the user will be able to write a program using the 
> >> >> >> DataStream
> >> >> >> API and this will execute efficiently on both bounded and unbounded
> >> >> >> data. But before we reach this point, it is worth discussing and
> >> >> >> agreeing on the semantics of some operations as we transition from 
> >> >> >> the
> >> >> >> streaming world to the batch one.
> >> >> >>
> >> >> >> This thread and the associated FLIP [2] aim at discussing these 
> >> >> >> issues
> >> >> >> as these topics are pretty important to users and can lead to
> >> >> >> unpleasant surprises if we do not pay attention.
> >> >> >>
> >> >> >> Let's have a healthy discussion here and I will be updating the FLIP
> >> >> >> accordingly.
> >> >> >>
> >> >> >> Cheers,
> >> >> >> Kostas
> >> >> >>
> >> >> >> [1] 
> >> >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741
> >> >> >> [2] 
> >> >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158871522

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

Reply via email to