[DISCUSS] Consolidate method naming between the batch and streaming API

Márton Balassi Mon, 01 Jun 2015 07:13:13 -0700

Looking at the DataSet and DataStream APIs we have come to the conclusion
with Aljoscha that there are a few methods that although providing the same
functionality are named differently. These are the following:


   1.  rebalance (batch) / distribute (streaming): Rebalances the data sent
   to the downstream operators thus equally distributing it.
   2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
   Partitioning has just recently been exposed in the streaming API and is not
   as refined as the batch one. The streaming partitionBy is actually
   partitionByHash.
   3. Union (batch) / merge, connect (streaming): The streaming merge does
   a union of two streams with the same type. Connect is conceptually
   different, it provides a way of sharing state between two streams with
   potentially different types without mapping them to a common type and then
   merging them. This saves latency and an ugly mapping. The former advantage
   can be offset by proper operator chaining, the second one would remain if
   we did not have connect.

To consolidate the naming I would suggest the following:

   1. Rename streaming distribute to rebalance.
   2. Rename streaming partitionBy to partitionByHash and file JIRA for
   custom partitioning support for streaming.
   3. Rename streaming merge to union, leave streaming connect as it is.

[DISCUSS] Consolidate method naming between the batch and streaming API

Reply via email to