Gopal V created HIVE-20624: ------------------------------ Summary: Vectorization: Fast Hash table should not double after certain size, instead should grow Key: HIVE-20624 URL: https://issues.apache.org/jira/browse/HIVE-20624 Project: Hive Issue Type: Bug Components: Vectorization Reporter: Gopal V
The reason to use Power of two is to simplify the inner loop for the hash function, but this is a significant memory issue when dealing with somewhat larger hashtables like the customer or customer_address hashtables in TPC-DS. This doubling is pretty bad when the hashtable load-factor is 0.75 and the expected key count is 65M (for customer/customer_address) {code} long worstCaseNeededSlots = 1L << DoubleMath.log2(numRows / hashTableLoadFactor, RoundingMode.UP); {code} That estimate is actually matching the actual impl, but after acquiring 65M items in a single array, the rehashing will require a temporary growth to 65+128 while the rehash is in progress, all to fit exactly 65 back into it. Fixing the estimate to match the implementation produced a number of regressions in query runtimes, though the part that needs fixing is the doubling implementation. The obvious solution is to add 4M more everytime and use a modulo function or the Lemire's multiply + shift operation[1], but more on that in comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)