[jira] [Commented] (HIVE-6430) MapJoin hash table has large memory overhead

Sergey Shelukhin (JIRA) Thu, 20 Mar 2014 16:24:06 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942504#comment-13942504
 ]


Sergey Shelukhin commented on HIVE-6430:
----------------------------------------

Finally fixed last glitches and got some memory numbers. Next, I will try on 
some queries on a real cluster...

On standard tables (over10k data file), we join the entire table with 7k rows 
of the same, on one column, resulting in only 407 unique keys. Each row 
contains 3 columns from the joined table.
Note that the "from" case uses LazyFlatRowContainer, so this is on top of gain 
from HIVE-6418.

The usage goes from:

|Class|Objects|Shallow Size|Retained Size|
|org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper|1|32|*880632*|
|java.util.HashMap|2|96|880560|
|java.util.HashMap$Entry[]|2|65632|880464|
|java.util.HashMap$Entry|407|13024|814832|
|java.lang.Object[]|810|101008|785488|
|org.apache.hadoop.hive.ql.exec.persistence.LazyFlatRowContainer|405|9720|775768|
|org.apache.hadoop.io.Text|7000|168000|394760|
|byte[]|7001|226776|226776|
|org.apache.hadoop.hive.serde2.io.DoubleWritable|7000|168000|168000|
|org.apache.hadoop.io.IntWritable|7000|112000|112000|
|org.apache.hadoop.hive.ql.exec.persistence.MapJoinKeyObject|405|6480|25920|
|org.apache.hadoop.io.LongWritable|405|9720|9720|
|java.lang.String|2|64|120|
|char[]|2|56|56|
|org.apache.hadoop.hive.serde2.ByteStream$Output|1|24|40|

To:
|Class|Objects|Shallow Size|Retained Size|
|org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer|1|32|*340664*|
|org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap|1|48|340392|
|java.util.ArrayList|4|96|209344|
|java.lang.Object[]|6|152|209304|
|org.apache.hadoop.hive.serde2.WriteBuffers|1|56|209256|
|byte[]|1|209152|209152|
|long[]|1|131088|131088|
|org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$KeyValueWriter|1|40|200|

*That is 61% reduction* on top of HIVE-6418.

If the join is on 4 columns (to increase number of unique keys to 7000, one row 
per key), it goes from:
|Class|Objects|Shallow Size|Retained Size|
|org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper|1|32|*2196624*|
|java.util.HashMap|2|96|2196552|
|java.util.HashMap$Entry[]|2|65632|2196456|
|java.util.HashMap$Entry|7002|224064|2130824|
|java.lang.Object[]|13999|447968|1626656|
|org.apache.hadoop.hive.ql.exec.persistence.LazyFlatRowContainer|7000|168000|1066760|
|org.apache.hadoop.hive.ql.exec.persistence.MapJoinKeyObject|6999|111984|839880|
|org.apache.hadoop.io.Text|7000|168000|394760|
|byte[]|7001|226776|226776|
|org.apache.hadoop.io.IntWritable|13999|223984|223984|
|org.apache.hadoop.hive.serde2.io.DoubleWritable|7000|168000|168000|
|org.apache.hadoop.io.LongWritable|6999|167976|167976|
|org.apache.hadoop.hive.serde2.io.ByteWritable|6999|111984|111984|
|org.apache.hadoop.hive.serde2.io.ShortWritable|6999|111984|111984|
|java.lang.String|2|64|120|
|char[]|2|56|56|
|org.apache.hadoop.hive.serde2.ByteStream$Output|1|24|40|


To:
|Class|Objects|Shallow Size|Retained Size|
|org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer|1|32|*452976*|
|org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap|1|48|452688|
|java.util.ArrayList|4|96|321648|
|java.lang.Object[]|6|168|321616|
|org.apache.hadoop.hive.serde2.WriteBuffers|1|56|321552|
|byte[]|1|321448|321448|
|long[]|1|131088|131088|
|org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$KeyValueWriter|1|40|216|

*That is 79% reduction* on top of HIVE-6418, or roughly 5 times smaller (this 
is a rather favorable case though).


> MapJoin hash table has large memory overhead
> --------------------------------------------
>
>                 Key: HIVE-6430
>                 URL: https://issues.apache.org/jira/browse/HIVE-6430
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch, 
> HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch, HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 
> for row) can take several hundred bytes, which is ridiculous. I am reducing 
> the size of MJKey and MJRowContainer in other jiras, but in general we don't 
> need to have java hash table there.  We can either use primitive-friendly 
> hashtable like the one from HPPC (Apache-licenced), or some variation, to map 
> primitive keys to single row storage structure without an object per row 
> (similar to vectorization).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6430) MapJoin hash table has large memory overhead

Reply via email to