Hi Gábor, thank you very much for your explanation, that makes a lot of sense.
Best regards, Urs On 05.09.2017 14:32, Gábor Gévay wrote: > Hi Urs, > > Yes, the 1/10th ratio is just a very loose rule of thumb. I would > suggest to try both the SORT and HASH strategies with a workload that > is as similar as possible to your production workload (similar data, > similar parallelism, etc.), and see which one is faster for your > specific use case. > > An important difference between the HASH and SORT strategies is that > the sorting combiner stores the original input records, while the hash > combiner stores only combined records. I.e., when an input record > arrives whose key is already in the hashtable then this record won't > consume additional memory, because it is combined right away. > Therefore, for example, if you would like your combiner to not emit > any records prematurely (i.e., combine everything possible, without > running out of memory), then with the SORT strategy you need combiner > memory proportional to your input size, while with the HASH strategy > you need combiner memory proportional only to the number of keys. > > You are correct in that the performance depends very much on how many > records fit into a single Sorter/Hashtable. However, I wrote > #keys/#total records into the documentation because this is easier for > a user to estimate, and this ratio being small correlates with the > HASH strategy getting faster, as explained above. > > Best, > Gábor > > > > On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <aljos...@apache.org> wrote: >> Hi, >> >> I would say that your assumption is correct and that the COMBINE strategy >> does in fact also depend on the ration " #total records/#records that fit >> into a single Sorter/Hashtable". >> >> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. >> >> Best, >> Aljoscha >> >>> On 31. Aug 2017, at 13:41, Urs Schoenenberger >>> <urs.schoenenber...@tngtech.com> wrote: >>> >>> Hi all, >>> >>> I was wondering about the heuristics for CombineHint: >>> >>> Flink uses SORT by default, but the doc for HASH says that we should >>> expect it to be faster if the number of keys is less than 1/10th of the >>> number of records. >>> >>> HASH should be faster if it is able to combine a lot of records, which >>> happens if multiple events for the same key are present in a data chunk >>> *that fits into a combine-hashtable* (cf handling in >>> ReduceCombineDriver.java). >>> >>> Now, if I have 10 billion events and 100 million keys, but only about 1 >>> million records fit into a hashtable, the number of matches may be >>> extremely low, so very few events are getting combined (of course, this >>> is similar for SORT as the sorter's memory is bounded, too). >>> >>> Am I correct in assuming that the actual tradeoff is not only based on >>> the ratio of #total records/#keys, but also on #total records/#records >>> that fit into a single Sorter/Hashtable? >>> >>> Thanks, >>> Urs >>> >>> -- >>> Urs Schönenberger - urs.schoenenber...@tngtech.com >>> >>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller >>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >> -- Urs Schönenberger - urs.schoenenber...@tngtech.com TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082