Assume you have a groupBy followed by a join.

DataSet1 (nor sorted) -> groupBy(A) --> join(1.A == 2.A)
                                        ^
DataSet2 (sorted on A) -----------------+

For groupBy(A) of DataSet1 the optimizer can pick hash-grouping or the
more expensive sort-based-grouping. If the optimizer pick
sort-based-grouping, the join becomes super cheap because if can just
perform a merge-join (with the need to sort the data, because both
datasets will be sorted on A already). Thus, the overhead of sorting in
the group might pay of in the join.

-Matthias

On 04/15/2016 10:50 PM, CPC wrote:
> Hi
> 
> When i look for what kind of optimizations flink does, i found
> https://cwiki.apache.org/confluence/display/FLINK/Optimizer+Internals  is
> it up to date? Also i couldnt understand:
> 
> "Reusing of partitionings and sort orders across operators. If one operator
> leaves the data in partitioned fashion (and or sorted order), the next
> operator will automatically try and reuse these characteristics. The
> planning for this is done holistically and can cause earlier operators to
> pick more expensive algorithms, if they allow for better reusing of
> sort-order and partitioning."
> 
> Can you give example for "earlier operators to pick more expensive
> algorithms" ?
> 
> Regards
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to