Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/1255#discussion_r43006432 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/DataSet.java --- @@ -1223,6 +1230,51 @@ public long count() throws Exception { final TypeInformation<K> keyType = TypeExtractor.getKeySelectorTypes(keyExtractor, getType()); return new PartitionOperator<T>(this, PartitionMethod.HASH, new Keys.SelectorFunctionKeys<T, K>(clean(keyExtractor), this.getType(), keyType), Utils.getCallLocationName()); } + + /** + * Range-partitions a DataSet using the specified KeySelector. + * <p> + * <b>Important:</b>This operation shuffles the whole DataSet over the network and can take significant amount of time. + * + * @param keySelector The KeySelector with which the DataSet is range-partitioned. + * @return The partitioned DataSet. + * + * @see KeySelector + */ + public <K extends Comparable<K>> DataSet<T> partitionByRange(KeySelector<T, K> keySelector) { + final TypeInformation<K> keyType = TypeExtractor.getKeySelectorTypes(keySelector, getType()); + String callLocation = Utils.getCallLocationName(); + + // Extract key from input element by keySelector. + KeyExtractorMapper<T, K> keyExtractorMapper = new KeyExtractorMapper<T, K>(keySelector); --- End diff -- It would be good to inject the sampling and partition ID assignment code in the `JobGraphGenerator` and not at the API level. The `JobGraphGenerator` is called after the `Optimizer` and translates the optimized plan into a parallel data flow called `JobGraph` which is executed by the runtime. The benefit of injecting the code at this point is that any range partitioning can be handled transparently within the optimizer. This means also other operators except the explicit `partitionByRange()` such as Join, CoGroup, and Reduce can benefit from range partitioning. In addition this makes the injected code a part of the runtime which can be more transparently improved later on. The downside (for you) is that the job abstraction is much lower at this level. However, you have still access to the chosen key fields and type information of all operators. See the `JavaApiPostPass` class to learn how to generate serializers and comparators at this level.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---