The case you are making is if a preceding operator in a chain is repeatedly
emitting the same object, and the succeeding operator is gathering the
objects, then it is a problem

Or are there cases where the system itself repeatedly emits the same
objects?

On Wed, May 20, 2015 at 3:07 PM, Gyula Fóra <gyf...@apache.org> wrote:

> We are designing a system for stateful stream computations, assuming long
> standing operators that gather and store data as the stream evolves (unlike
> in the dataset api). Many programs, like windowing, sampling etc hold the
> state in the form of past data. And without careful understanding of the
> runtime these programs will break or have unnecessary copies.
>
> This is why I think immutability should be the default so we can have a
> clear dataflow model with immutable streams.
>
> I see absolutely no reason why we cant have the non-copy version as an
> optional setting for the users.
>
>
> On Wed, May 20, 2015 at 2:21 PM, Paris Carbone <par...@kth.se> wrote:
>
> > @stephan I see your point. If we assume that operators do not hold
> > references in their state to any transmitted records it works fine. We
> > therefore need to make this clear to the users. I need to check if that
> > would break semantics in SAMOA or other integrations as well that assume
> > immutability. For example in SAMOA there are often local metric objects
> > that are being constantly mutated and simply forwarded periodically to
> > other (possibly chained) operators that need to evaluate them.
> >
> > ________________________________________
> > From: Gyula Fóra <gyf...@apache.org>
> > Sent: Wednesday, May 20, 2015 2:06 PM
> > To: dev@flink.apache.org
> > Subject: Re: [DISCUSS] Re-add record copy to chained operator calls
> >
> > "Copy before putting it into a window buffer and any other group buffer."
> >
> > Exactly my point. Any stateful operator should be able to implement
> > something like this without having to worry about copying the object (and
> > at this point the user would need to know whether it comes from the
> network
> > to avoid unnecessary copies), so I don't agree with leaving the copy off.
> >
> > The user can of course specify that the operator is mutable if he wants
> > (and he is worried about the performance), But I still think the default
> > behaviour should be immutable.
> > We cannot force users to not hold object references and also it is a
> quite
> > unnatural way of programming in a language like java.
> >
> >
> > On Wed, May 20, 2015 at 1:39 PM, Stephan Ewen <se...@apache.org> wrote:
> >
> > > I am curious why the copying is actually needed.
> > >
> > > In the batch API, we chain and do not copy and it is rather
> predictable.
> > >
> > > The cornerpoints of that design is to follow these rules:
> > >
> > >  1) Objects read from the network or any buffer are always new objects.
> > > That comes naturally when they are deserialized as part of that (all
> > > buffers store serialized)
> > >
> > >  2) After a function returned a record (or gives one to the collector),
> > it
> > > if given to the chain of chained operators, but after it is through the
> > > chain, no one else holds a reference to that object.
> > >      For that, it is crucial that objects are not stored by reference,
> > but
> > > either stored serialized, or a copy is stored.
> > >
> > > This is quite solid in the batch API. How about we follow the same
> > paradigm
> > > in the streaming API. We would need to adjust the following:
> > >
> > > 1) Do not copy between operators (I think this is the case right now)
> > >
> > > 2) Copy before putting it into a window buffer and any other group
> > buffer.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, May 20, 2015 at 1:22 PM, Aljoscha Krettek <aljos...@apache.org
> >
> > > wrote:
> > >
> > > > Yes, in fact I anticipated this. There is one central place where we
> > > > can insert a copy step, in OperatorCollector in OutputHandler.
> > > >
> > > > On Wed, May 20, 2015 at 11:17 AM, Paris Carbone <par...@kth.se>
> wrote:
> > > > > I guess it was not intended ^^.
> > > > >
> > > > > Chaining should be transparent and not break the correct/expected
> > > > behaviour.
> > > > >
> > > > >
> > > > > Paris?
> > > > >
> > > > > On 20 May 2015, at 11:02, Márton Balassi <mbala...@apache.org>
> > wrote:
> > > > >
> > > > > +1 for copying.
> > > > > On May 20, 2015 10:50 AM, "Gyula Fóra" <gyf...@apache.org> wrote:
> > > > >
> > > > > Hey,
> > > > >
> > > > > The latest streaming operator rework removed the copying of the
> > outputs
> > > > > before passing them to chained operators. This is a major break for
> > the
> > > > > previous operator semantics which guaranteed immutability.
> > > > >
> > > > > I think this change leads to very indeterministic program behaviour
> > > from
> > > > > the user's perspective as only non-chained outputs/inputs will be
> > > > mutable.
> > > > > If we allow this to happen, users will start disabling chaining to
> > get
> > > > > immutability which defeats the purpose. (chaining should not affect
> > > > program
> > > > > behaviour just increase performance)
> > > > >
> > > > > In my opinion the default setting for each operator should be
> > > > immutability
> > > > > and the user could override this manually if he/she wants.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Regards,
> > > > > Gyula
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to