[ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174730#comment-13174730 ]
Phabricator commented on HIVE-2621: ----------------------------------- kevinwilfong has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.". INLINE COMMENTS ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 I actually had it in one MR job originally, but the addition of the limits caused the addition of an extra job for each insert. The code to do that is at line 6379 of the SemanticAnalyzer. I assume it does this to ensure all the rows go to a single reducer, so the limit can be enforced. As for making hive.multigroupby.singlereducer true by default, currently map aggregation is set to true by default, and setting this new variable causes the map aggregation variable to be ignored for any group by that can go to the same reducer as at least one other. I'll turn it on by default, I just wanted to make that clear. Also, in that case this test isn't needed as I just copied another test and turned on hive.multigroupby.singlereducer ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6211 This code looks at all the group bys in the query block and allows for only some of the distinct expressions and group by keys to be the same. The new code looks for sets of group by subqueries where all of the distinct expressions and group by keys are the same. I'll go back and see if I reused as much of the definitions of the two functions getCommonDistinctExprs and getCommonGroupByKeys as I could have, but I don't think this code here could be simplified using my new code. ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6273 The above if block guarantees the job can be performed using only 2 MR jobs provided that there are no filter operators, and the distinct keys are the same across all subqueries. If my new code were used, there is no hard limit on the number of MR jobs because it depends on how many variations there are on the group by keys. I think it is possible that we could first look at the set of subqueries without filters, see how they split on both distinct and group by keys, and for any of these groups where two or more have the same distinct keys run this other method on them. However, this could have been done much more easily before (without the comparison, just group queries with the same distinct keys together), and it wasn't, so I suspect the gains are not that great, so I'd much rather do this as part of a separate patch once we have queries that would benefit from it. I'm not sure whether or not the code is broken, but I would like to leave it for now, to be fixed, or modified as I described above as part of a separate patch. REVISION DETAIL https://reviews.facebook.net/D567 > Allow multiple group bys with the same input data and spray keys to be run on > the same reducer. > ----------------------------------------------------------------------------------------------- > > Key: HIVE-2621 > URL: https://issues.apache.org/jira/browse/HIVE-2621 > Project: Hive > Issue Type: New Feature > Reporter: Kevin Wilfong > Assignee: Kevin Wilfong > Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, > HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch > > > Currently, when a user runs a query, such as a multi-insert, where each > insertion subclause consists of a simple query followed by a group by, the > group bys for each clause are run on a separate reducer. This requires > writing the data for each group by clause to an intermediate file, and then > reading it back. This uses a significant amount of the total CPU consumed by > the query for an otherwise simple query. > If the subclauses are grouped by their distinct expressions and group by > keys, with all of the group by expressions for a group of subclauses run on a > single reducer, this would reduce the amount of reading/writing to > intermediate files for some queries. > To do this, for each group of subclauses, in the mapper we would execute a > the filters for each subclause 'or'd together (provided each subclause has a > filter) followed by a reduce sink. In the reducer, the child operators would > be each subclauses filter followed by the group by and any subsequent > operations. > Note that this would require turning off map aggregation, so we would need to > make using this type of plan configurable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira