[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174730#comment-13174730
 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

kevinwilfong has commented on the revision "HIVE-2621 [jira] Allow multiple 
group bys with the same input data and spray keys to be run on the same 
reducer.".

INLINE COMMENTS
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 
I actually had it in one MR job originally, but the addition of the limits 
caused the addition of an extra job for each insert.  The code to do that is at 
line 6379 of the SemanticAnalyzer.  I assume it does this to ensure all the 
rows go to a single reducer, so the limit can be enforced.

  As for making hive.multigroupby.singlereducer true by default, currently map 
aggregation is set to true by default, and setting this new variable causes the 
map aggregation variable to be ignored for any group by that can go to the same 
reducer as at least one other.

  I'll turn it on by default, I just wanted to make that clear.

  Also, in that case this test isn't needed as I just copied another test and 
turned on hive.multigroupby.singlereducer
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6211 This 
code looks at all the group bys in the query block and allows for only some of 
the distinct expressions and group by keys to be the same.

  The new code looks for sets of group by subqueries where all of the distinct 
expressions and group by keys are the same.

  I'll go back and see if I reused as much of the definitions of the two 
functions getCommonDistinctExprs and getCommonGroupByKeys as I could have, but 
I don't think this code here could be simplified using my new code.
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6273 The 
above if block guarantees the job can be performed using only 2 MR jobs 
provided that there are no filter operators, and the distinct keys are the same 
across all subqueries.

  If my new code were used, there is no hard limit on the number of MR jobs 
because it depends on how many variations there are on the group by keys.

  I think it is possible that we could first look at the set of subqueries 
without filters, see how they split on both distinct and group by keys, and for 
any of these groups where two or more have the same distinct keys run this 
other method on them.  However, this could have been done much more easily 
before (without the comparison, just group queries with the same distinct keys 
together), and it wasn't, so I suspect the gains are not that great, so I'd 
much rather do this as part of a separate patch once we have queries that would 
benefit from it.

  I'm not sure whether or not the code is broken, but I would like to leave it 
for now, to be fixed, or modified as I described above as part of a separate 
patch.

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on 
> the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
> HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each 
> insertion subclause consists of a simple query followed by a group by, the 
> group bys for each clause are run on a separate reducer.  This requires 
> writing the data for each group by clause to an intermediate file, and then 
> reading it back.  This uses a significant amount of the total CPU consumed by 
> the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by 
> keys, with all of the group by expressions for a group of subclauses run on a 
> single reducer, this would reduce the amount of reading/writing to 
> intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a 
> the filters for each subclause 'or'd together (provided each subclause has a 
> filter) followed by a reduce sink.  In the reducer, the child operators would 
> be each subclauses filter followed by the group by and any subsequent 
> operations.
> Note that this would require turning off map aggregation, so we would need to 
> make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to