Looking at the DataSet and DataStream APIs we have come to the conclusion with Aljoscha that there are a few methods that although providing the same functionality are named differently. These are the following:
1. rebalance (batch) / distribute (streaming): Rebalances the data sent to the downstream operators thus equally distributing it. 2. partitionByHash, partitionCustom (batch) / partitionBy (streaming): Partitioning has just recently been exposed in the streaming API and is not as refined as the batch one. The streaming partitionBy is actually partitionByHash. 3. Union (batch) / merge, connect (streaming): The streaming merge does a union of two streams with the same type. Connect is conceptually different, it provides a way of sharing state between two streams with potentially different types without mapping them to a common type and then merging them. This saves latency and an ugly mapping. The former advantage can be offset by proper operator chaining, the second one would remain if we did not have connect. To consolidate the naming I would suggest the following: 1. Rename streaming distribute to rebalance. 2. Rename streaming partitionBy to partitionByHash and file JIRA for custom partitioning support for streaming. 3. Rename streaming merge to union, leave streaming connect as it is.