[ https://issues.apache.org/jira/browse/HIVE-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999580#comment-13999580 ]
Gopal V commented on HIVE-7074: ------------------------------- To bring up tpc-ds 200-scale inventory table as a comparison. {code} hive> select inv_item_sk%200 as bucket, count(1) as total from inventory group by inv_item_sk % 200 order by total desc limit 1; 199 225120 Time taken: 11.112 seconds, Fetched: 1 row(s) {code} versus 199 buckets. {code} hive> select inv_item_sk%199 as bucket, count(1) as total from inventory group by inv_item_sk % 199 order by total desc limit 1; 31 190428 Time taken: 10.377 seconds, Fetched: 1 row(s) {code} That's 18% more for the slowest bucket while needing an extra reducer. > The reducer parallelism should be a prime number for better stride protection > ----------------------------------------------------------------------------- > > Key: HIVE-7074 > URL: https://issues.apache.org/jira/browse/HIVE-7074 > Project: Hive > Issue Type: Improvement > Components: Statistics > Reporter: Gopal V > Assignee: Gopal V > Attachments: HIVE-7074.1.patch > > > The current hive reducer parallelism results in stride issues with key > distribution. > a JOIN generating even numbers will get strided onto only some of the > reducers. > The probability of distribution skew is controlled by the number of common > factors shared by the hashcode of the key and the number of buckets. > Using a prime number within the reducer estimation will cut that probability > down by a significant amount. -- This message was sent by Atlassian JIRA (v6.2#6252)