Hi all, I was wondering about the heuristics for CombineHint:
Flink uses SORT by default, but the doc for HASH says that we should expect it to be faster if the number of keys is less than 1/10th of the number of records. HASH should be faster if it is able to combine a lot of records, which happens if multiple events for the same key are present in a data chunk *that fits into a combine-hashtable* (cf handling in ReduceCombineDriver.java). Now, if I have 10 billion events and 100 million keys, but only about 1 million records fit into a hashtable, the number of matches may be extremely low, so very few events are getting combined (of course, this is similar for SORT as the sorter's memory is bounded, too). Am I correct in assuming that the actual tradeoff is not only based on the ratio of #total records/#keys, but also on #total records/#records that fit into a single Sorter/Hashtable? Thanks, Urs -- Urs Schönenberger - urs.schoenenber...@tngtech.com TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082