It seems like I'm one of the few people that run into the mutable elements
trap on the Batch API from time to time. At the moment I always clone when
I'm not 100% sure to avoid hunting the bugs later. So far I was happy to
learn that this is not a problem in Streaming, but that's just me.

When working with groupby and partition functions, its easy to forget that
there is one class per operator not per partition. So if you write your
code in the state of mind that each partition is separate object reduce on
operator level becomes really annoying.
Since the mapping between partitions and operators is usually hidden, makes
the debugging harder especially in cases where the test data produces a
single partition per operator and the real deployment does not.

*To summarize:*
I'm not against reusing objects as long as there is something that helps
ease the pitfalls. This could be coding guidelines, debugging tools or best
practices.


On Fri, Oct 2, 2015 at 5:53 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi all!
>
> Now that we are coming to the next release, I wanted to make sure we
> finalize the decision on that point, because it would be nice to not break
> the behavior of system afterwards.
>
> Right now, when tasks are chained together, the system copies the elements
> always between different tasks in the same chain.
>
> I think this policy was established under the assumption that copies do not
> cost anything, given our own test examples, which mainly use immutable
> types like Strings, boxed primitives, ..
>
> In practice, a lot of data types are actually quite expensive to copy.
>
> For example, a rather common data type in the event analysis of web-sources
> is JSON Object.
> Flink treats this as a generic type. Depending on its concrete
> implementation, Kryo may have perform a serialization copy, which means
> encoding into bytes (JSON encoding, charset encoding) and decoding again.
>
> This has a massive impact on the out-of-the-box performance of the system.
> Given that, I was wondering whether we should set to default policy to "not
> copying".
>
> That is basically the behavior of the batch API, and there has so far never
> been an issue with that (people running into the trap of overwritten
> mutable elements).
>
> What do you think?
>
> Stephan
>

Reply via email to