Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1255#discussion_r43006432
  
    --- Diff: flink-java/src/main/java/org/apache/flink/api/java/DataSet.java 
---
    @@ -1223,6 +1230,51 @@ public long count() throws Exception {
                final TypeInformation<K> keyType = 
TypeExtractor.getKeySelectorTypes(keyExtractor, getType());
                return new PartitionOperator<T>(this, PartitionMethod.HASH, new 
Keys.SelectorFunctionKeys<T, K>(clean(keyExtractor), this.getType(), keyType), 
Utils.getCallLocationName());
        }
    +
    +   /**
    +    * Range-partitions a DataSet using the specified KeySelector.
    +    * <p>
    +    * <b>Important:</b>This operation shuffles the whole DataSet over the 
network and can take significant amount of time.
    +    *
    +    * @param keySelector The KeySelector with which the DataSet is 
range-partitioned.
    +    * @return The partitioned DataSet.
    +    *
    +    * @see KeySelector
    +    */
    +   public <K extends Comparable<K>> DataSet<T> 
partitionByRange(KeySelector<T, K> keySelector) {
    +           final TypeInformation<K> keyType = 
TypeExtractor.getKeySelectorTypes(keySelector, getType());
    +           String callLocation = Utils.getCallLocationName();
    +
    +           // Extract key from input element by keySelector.
    +           KeyExtractorMapper<T, K> keyExtractorMapper = new 
KeyExtractorMapper<T, K>(keySelector);
    --- End diff --
    
    It would be good to inject the sampling and partition ID assignment code in 
the `JobGraphGenerator` and not at the API level. The `JobGraphGenerator` is 
called after the `Optimizer` and translates the optimized plan into a parallel 
data flow called `JobGraph` which is executed by the runtime. The benefit of 
injecting the code at this point is that any range partitioning can be handled 
transparently within the optimizer. This means also other operators except the 
explicit `partitionByRange()` such as Join, CoGroup, and Reduce can benefit 
from range partitioning. In addition this makes the injected code a part of the 
runtime which can be more transparently improved later on. 
    
    The downside (for you) is that the job abstraction is much lower at this 
level. However, you have still access to the chosen key fields and type 
information of all operators. See the `JavaApiPostPass` class to learn how to 
generate serializers and comparators at this level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to