Hello,
I think if you have multiple keyBy() transformations with identical
parallelism the partitioning should
be "preserved". The second keyBy() will still go through the
partitioning process, but since both the key
and parallelism are identical the resulting partition should be
identical as well. resulting in no data being
shuffled around. We aren't really preserving the partitioning, but
re-creating the original one.
Regards,
Chesnay
On 12.04.2017 21:37, Ryan Conway wrote:
Greetings,
Is there a means of maintaining a stream's partitioning after running
it through an operation such as map or filter?
I have a pipeline stage S that operates on a stream partitioned by an
ID field. S flat maps objects of type A to type B, which both have an
"ID" field, and where each instance of B that S outputs has the same
ID as its input instance of A. I hope to add a pipeline stage T
immediately after S that operates using the same partitioning as S, so
that I can avoid the expense of re-keying the instances of type B.
If I am understanding the DataStream API correctly this is not
feasible with Flink, as map(), filter() etc. all
output SingleOutputStreamOperator. But I am hoping that I am missing
something.
Thank you,
Ryan