[ https://issues.apache.org/jira/browse/HIVE-23356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashutosh Chauhan updated HIVE-23356: ------------------------------------ Fix Version/s: 4.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) Pushed to master. Thanks, Qiang.Kang! > Hash aggregation is always disabled while processing querys with grouping > sets expressions. > ------------------------------------------------------------------------------------------- > > Key: HIVE-23356 > URL: https://issues.apache.org/jira/browse/HIVE-23356 > Project: Hive > Issue Type: Bug > Components: Hive, Operators, Query Processor > Affects Versions: All Versions > Reporter: Qiang.Kang > Assignee: Qiang.Kang > Priority: Critical > Labels: performance > Fix For: 4.0.0 > > Attachments: HIVE-23356.01.patch, HIVE-23356.02.patch, > HIVE-23356.03.patch > > > Hi, we find that many MapReduce map attempts disable hash aggregation while > processing GroupBy Operator with grouping sets expressions. Logs are just > like this: > `Disable Hash Aggr: #hash table = 104427 #total = 100000 reduction = 1.0 > minReduction = 0.5`, which means hashTable size is even larger than the > number of input rows. > Dig deeper we can find that GroupBy operator disables hash aggregation just > based on the number of input rows instead of rows actually processed. For > GroupBy Operator with grouping sets (such as: roll up, cube, etc.), > `numberRowsProcessed` equals `groupingSets.size()*numberRowsInput`, causing > hashTableSize much larger than querys without grouping sets expressions. > Let's consider a example below: > - Query: > SELECT col1, col2, MAX(col0) FROM tbl GROUP BY col1, col2 GROUPING SETS > (col1,col2); > - Input data: > col0, col1, col2 > [0,-1,-1] > [1,-1,-1] > [2,-1,-1] > [3,3,3] > [4,4,4] > - Configuration: > hive.groupby.mapaggr.checkinterval = 1; > minReductionHashAggr = 0.5; > - Currently, there are 10 rows in partial aggregate result. (hash aggregation > is disabled very early): > [-1, null, 0] > [null, -1, 0] > [-1, null, 1] > [null, -1, 1] > [-1, null, 2] > [null, -1, 2] > [3, null, 3] > [null, 3, 3] > [4, null, 4] > [null, 4, 4] > The patch below disables hash aggregation based on the number of rows > actually processed, reducing partial result entries a lot and improving the > performance of aggregation. There are only 6 rows in partial aggregate result: > [-1, null, 2] > [null, -1, 2] > [null, 3, 3] > [4, null, 4] > [null, 4, 4] > [3, null, 3] -- This message was sent by Atlassian Jira (v8.3.4#803005)