Gopal V created HIVE-20624:
------------------------------

             Summary: Vectorization: Fast Hash table should not double after 
certain size, instead should grow
                 Key: HIVE-20624
                 URL: https://issues.apache.org/jira/browse/HIVE-20624
             Project: Hive
          Issue Type: Bug
          Components: Vectorization
            Reporter: Gopal V


The reason to use Power of two is to simplify the inner loop for the hash 
function, but this is a significant memory issue when dealing with somewhat 
larger hashtables like the customer or customer_address hashtables in TPC-DS.

This doubling is pretty bad when the hashtable load-factor is 0.75 and the 
expected key count is 65M (for customer/customer_address)

{code}
    long worstCaseNeededSlots = 1L << DoubleMath.log2(numRows / 
hashTableLoadFactor, RoundingMode.UP);
{code}

That estimate is actually matching the actual impl, but after acquiring 65M 
items in a single array, the rehashing will require a temporary growth to 
65+128 while the rehash is in progress, all to fit exactly 65 back into it.

Fixing the estimate to match the implementation produced a number of 
regressions in query runtimes, though the part that needs fixing is the 
doubling implementation.

The obvious solution is to add 4M more everytime and use a modulo function or 
the Lemire's multiply + shift operation[1], but more on that in comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to