[ https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678852#comment-16678852 ]
Gopal V commented on HIVE-20873: -------------------------------- [~bslim]: Teddy & I have a UDF for the hash function, which we use to calculate skews. I've merged Teddy's changes into it https://github.com/t3rmin4t0r/long-hash-udf {code} select long2hash(i_item_sk, 1) & 255, count(1) from item group by long2hash(i_item_sk, 1) & 255 order by count(1) desc ; 0 65536 2 65536 3 65536 1 65535 5 37857 {code} So there's a bit-skew in the old hash function, instead of generating 256 unique bit-patterns, but it skews the low-bits by the 2nd arg to the long2 hash. {code} select long2murmur(i_item_sk, 1) & 255, count(1) from item group by long2murmur(i_item_sk, 1) & 255 order by count(1) desc ; 170 1274 37 1264 220 1254 110 1253 152 1241 5 1235 56 1232 179 1231 231 1228 168 1228 149 1228 84 1222 ... 156 1082 Time taken: 1.727 seconds, Fetched: 256 row(s) {code} > Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision > ------------------------------------------------------------------------ > > Key: HIVE-20873 > URL: https://issues.apache.org/jira/browse/HIVE-20873 > Project: Hive > Issue Type: Improvement > Reporter: Teddy Choi > Assignee: Teddy Choi > Priority: Major > Labels: pull-request-available > Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch > > > VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and > XOR operators for short computation time, but more hash collision. Group by > operations become very slow on large data sets. It needs Murmur hash or a > better hash function for less hash collision. -- This message was sent by Atlassian JIRA (v7.6.3#76005)