Hi all!

Now that we are coming to the next release, I wanted to make sure we
finalize the decision on that point, because it would be nice to not break
the behavior of system afterwards.

Right now, when tasks are chained together, the system copies the elements
always between different tasks in the same chain.

I think this policy was established under the assumption that copies do not
cost anything, given our own test examples, which mainly use immutable
types like Strings, boxed primitives, ..

In practice, a lot of data types are actually quite expensive to copy.

For example, a rather common data type in the event analysis of web-sources
is JSON Object.
Flink treats this as a generic type. Depending on its concrete
implementation, Kryo may have perform a serialization copy, which means
encoding into bytes (JSON encoding, charset encoding) and decoding again.

This has a massive impact on the out-of-the-box performance of the system.
Given that, I was wondering whether we should set to default policy to "not
copying".

That is basically the behavior of the batch API, and there has so far never
been an issue with that (people running into the trap of overwritten
mutable elements).

What do you think?

Stephan

Reply via email to