Hi Gábor,

thank you very much for your explanation, that makes a lot of sense.

Best regards,
Urs

On 05.09.2017 14:32, Gábor Gévay wrote:
> Hi Urs,
> 
> Yes, the 1/10th ratio is just a very loose rule of thumb. I would
> suggest to try both the SORT and HASH strategies with a workload that
> is as similar as possible to your production workload (similar data,
> similar parallelism, etc.), and see which one is faster for your
> specific use case.
> 
> An important difference between the HASH and SORT strategies is that
> the sorting combiner stores the original input records, while the hash
> combiner stores only combined records. I.e., when an input record
> arrives whose key is already in the hashtable then this record won't
> consume additional memory, because it is combined right away.
> Therefore, for example, if you would like your combiner to not emit
> any records prematurely (i.e., combine everything possible, without
> running out of memory), then with the SORT strategy you need combiner
> memory proportional to your input size, while with the HASH strategy
> you need combiner memory proportional only to the number of keys.
> 
> You are correct in that the performance depends very much on how many
> records fit into a single Sorter/Hashtable. However, I wrote
> #keys/#total records into the documentation because this is easier for
> a user to estimate, and this ratio being small correlates with the
> HASH strategy getting faster, as explained above.
> 
> Best,
> Gábor
> 
> 
> 
> On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <aljos...@apache.org> wrote:
>> Hi,
>>
>> I would say that your assumption is correct and that the COMBINE strategy 
>> does in fact also depend on the ration " #total records/#records that fit 
>> into a single Sorter/Hashtable".
>>
>> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do.
>>
>> Best,
>> Aljoscha
>>
>>> On 31. Aug 2017, at 13:41, Urs Schoenenberger 
>>> <urs.schoenenber...@tngtech.com> wrote:
>>>
>>> Hi all,
>>>
>>> I was wondering about the heuristics for CombineHint:
>>>
>>> Flink uses SORT by default, but the doc for HASH says that we should
>>> expect it to be faster if the number of keys is less than 1/10th of the
>>> number of records.
>>>
>>> HASH should be faster if it is able to combine a lot of records, which
>>> happens if multiple events for the same key are present in a data chunk
>>> *that fits into a combine-hashtable* (cf handling in
>>> ReduceCombineDriver.java).
>>>
>>> Now, if I have 10 billion events and 100 million keys, but only about 1
>>> million records fit into a hashtable, the number of matches may be
>>> extremely low, so very few events are getting combined (of course, this
>>> is similar for SORT as the sorter's memory is bounded, too).
>>>
>>> Am I correct in assuming that the actual tradeoff is not only based on
>>> the ratio of #total records/#keys, but also on #total records/#records
>>> that fit into a single Sorter/Hashtable?
>>>
>>> Thanks,
>>> Urs
>>>
>>> --
>>> Urs Schönenberger - urs.schoenenber...@tngtech.com
>>>
>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>

-- 
Urs Schönenberger - urs.schoenenber...@tngtech.com

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Reply via email to