[
https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HIVE-6430:
-----------------------------------
Attachment: HIVE-6430.07.patch
Patch that fixes some issues, main thing is that Murmur hash from guava is
used; hashing behavior is very bad with previous hash code method and perf
suffers a lot.
There's also an issue with previously used expand method. To make expand fast,
hash is now stored fully. This is not necessary for anything else so it's a
tradeoff - more memory (+4 bytes per key) or expensive rehash. We may do it
later.
Fast paths were added to WriteBuffers for the majority of cases where whatever
we are doing is all in one buffer. There's some bug in there that causes some
queries to fail, I'll investigate... want to UL patch with what is done, the
queries with large map joins that do work now run approximately as fast as
before (will later measure more precisely) in a fraction of memory.
> MapJoin hash table has large memory overhead
> --------------------------------------------
>
> Key: HIVE-6430
> URL: https://issues.apache.org/jira/browse/HIVE-6430
> Project: Hive
> Issue Type: Improvement
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch,
> HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch,
> HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2
> for row) can take several hundred bytes, which is ridiculous. I am reducing
> the size of MJKey and MJRowContainer in other jiras, but in general we don't
> need to have java hash table there. We can either use primitive-friendly
> hashtable like the one from HPPC (Apache-licenced), or some variation, to map
> primitive keys to single row storage structure without an object per row
> (similar to vectorization).
--
This message was sent by Atlassian JIRA
(v6.2#6252)