[jira] [Updated] (HIVE-6430) MapJoin hash table has large memory overhead

Sergey Shelukhin (JIRA) Wed, 16 Apr 2014 18:08:24 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-6430:
-----------------------------------

    Attachment: HIVE-6430.07.patch

Patch that fixes some issues, main thing is that Murmur hash from guava is 
used; hashing behavior is very bad with previous hash code method and perf 
suffers a lot.
There's also an issue with previously used expand method. To make expand fast, 
hash is now stored fully. This is not necessary for anything else so it's a 
tradeoff - more memory (+4 bytes per key) or expensive rehash. We may do it 
later.
Fast paths were added to WriteBuffers for the majority of cases where whatever 
we are doing is all in one buffer. There's some bug in there that causes some 
queries to fail, I'll investigate... want to UL patch with what is done, the 
queries with large map joins that do work now run approximately as fast as 
before (will later measure more precisely) in a fraction of memory.

> MapJoin hash table has large memory overhead
> --------------------------------------------
>
>                 Key: HIVE-6430
>                 URL: https://issues.apache.org/jira/browse/HIVE-6430
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch, 
> HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch, 
> HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 
> for row) can take several hundred bytes, which is ridiculous. I am reducing 
> the size of MJKey and MJRowContainer in other jiras, but in general we don't 
> need to have java hash table there.  We can either use primitive-friendly 
> hashtable like the one from HPPC (Apache-licenced), or some variation, to map 
> primitive keys to single row storage structure without an object per row 
> (similar to vectorization).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6430) MapJoin hash table has large memory overhead

Reply via email to