[jira] [Updated] (HIVE-6430) MapJoin hash table has large memory overhead

Sergey Shelukhin (JIRA) Wed, 23 Apr 2014 19:56:24 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-6430:
-----------------------------------

    Attachment: HIVE-6430.09.patch

This replaces guava murmurhash with inline one, and adds (untested) 
serialization bypass for serdes (testing fast query, hash and byte copies in 
serdes are the most prominent differences in my profiled runs). Unfortunately, 
for the latter I've discovered that keys given to us are serialized using 
BinarySortableSerDe because they come from ReduceSinkOperator. Will need to 
sync w/Gunther tomorrow on this. Most likely outcome is that we'll change the 
tez hashtable output to lazy serde, so we could just copy bytes. Alternative 
would be to change key serialization to binarysortable, but that's ugly because 
values would stay on lazybinary so we will have two paths. Plus bunch of 
changes will be required to binarysortable to not have byte copies again, and 
use RandomAccessOutput instead of its OutputBuffer thing. Yet another 
alternative is to do bypass only for values, not keys.

Regardless, I think we should be committing this patch soon (even if off by 
default), and doing additional improvements in separate jiras.
It's growing too big.

> MapJoin hash table has large memory overhead
> --------------------------------------------
>
>                 Key: HIVE-6430
>                 URL: https://issues.apache.org/jira/browse/HIVE-6430
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch, 
> HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch, 
> HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.08.patch, 
> HIVE-6430.09.patch, HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 
> for row) can take several hundred bytes, which is ridiculous. I am reducing 
> the size of MJKey and MJRowContainer in other jiras, but in general we don't 
> need to have java hash table there.  We can either use primitive-friendly 
> hashtable like the one from HPPC (Apache-licenced), or some variation, to map 
> primitive keys to single row storage structure without an object per row 
> (similar to vectorization).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6430) MapJoin hash table has large memory overhead

Reply via email to