Assume you have a groupBy followed by a join. DataSet1 (nor sorted) -> groupBy(A) --> join(1.A == 2.A) ^ DataSet2 (sorted on A) -----------------+
For groupBy(A) of DataSet1 the optimizer can pick hash-grouping or the more expensive sort-based-grouping. If the optimizer pick sort-based-grouping, the join becomes super cheap because if can just perform a merge-join (with the need to sort the data, because both datasets will be sorted on A already). Thus, the overhead of sorting in the group might pay of in the join. -Matthias On 04/15/2016 10:50 PM, CPC wrote: > Hi > > When i look for what kind of optimizations flink does, i found > https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals is > it up to date? Also i couldnt understand: > > "Reusing of partitionings and sort orders across operators. If one operator > leaves the data in partitioned fashion (and or sorted order), the next > operator will automatically try and reuse these characteristics. The > planning for this is done holistically and can cause earlier operators to > pick more expensive algorithms, if they allow for better reusing of > sort-order and partitioning." > > Can you give example for "earlier operators to pick more expensive > algorithms" ? > > Regards >
signature.asc
Description: OpenPGP digital signature