[ https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677804#comment-16677804 ]
Teddy Choi commented on HIVE-20873: ----------------------------------- In my case, TPC-H query 21 and TPC-DS query 16 seem related with it. TPC-H query 21 uses map join, and TPC-DS query 16 uses group by. Both of them use VectorHashKeyWrapperBatch, which uses VectorHashKeyWrapperSingleLong, which uses HashCodeUtil.calculateLongHashCode. Also there are other hash algorithms, but Murmur3 is already used in Hadoop and Hive. See org.apache.hive.common.util.Murmur3 and org.apache.hadoop.util.hash.MurmurHash. So I think it would be safe to use Murmur3 instead of benchmarking other hash algorithms. > Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision > ------------------------------------------------------------------------ > > Key: HIVE-20873 > URL: https://issues.apache.org/jira/browse/HIVE-20873 > Project: Hive > Issue Type: Improvement > Reporter: Teddy Choi > Assignee: Teddy Choi > Priority: Major > Labels: pull-request-available > Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch > > > VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and > XOR operators for short computation time, but more hash collision. Group by > operations become very slow on large data sets. It needs Murmur hash or a > better hash function for less hash collision. -- This message was sent by Atlassian JIRA (v7.6.3#76005)