DataSet: CombineHint heuristics

Urs Schoenenberger Thu, 31 Aug 2017 04:42:43 -0700

Hi all,

I was wondering about the heuristics for CombineHint:


Flink uses SORT by default, but the doc for HASH says that we should
expect it to be faster if the number of keys is less than 1/10th of the
number of records.

HASH should be faster if it is able to combine a lot of records, which
happens if multiple events for the same key are present in a data chunk
*that fits into a combine-hashtable* (cf handling in
ReduceCombineDriver.java).

Now, if I have 10 billion events and 100 million keys, but only about 1
million records fit into a hashtable, the number of matches may be
extremely low, so very few events are getting combined (of course, this
is similar for SORT as the sorter's memory is bounded, too).

Am I correct in assuming that the actual tradeoff is not only based on
the ratio of #total records/#keys, but also on #total records/#records
that fit into a single Sorter/Hashtable?

Thanks,
Urs

-- 
Urs Schönenberger - urs.schoenenber...@tngtech.com

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

DataSet: CombineHint heuristics

Reply via email to