[jira] [Commented] (HIVE-2340) optimize orderby followed by a groupby

Ashutosh Chauhan (JIRA) Thu, 17 Jan 2013 11:36:28 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556513#comment-13556513
 ]


Ashutosh Chauhan commented on HIVE-2340:
----------------------------------------

Yeah, correct JOIN-GBY and GBY-GBY are taken care of in ysmart also. Its the 
group-by followed by order-by case which is also of interest to me, which this 
already covers. 

Besides the scenario covered by these two patches, I am also comparing the 
approaches taken in these two. I have just briefly looked at this patch, but 
fundamental difference which I can make out in this approach Vs ysmart approach 
is that here RS is deduplicated that is completely removed from operator 
pipeline, wherever it could be (i.e. when keys of subsequent RS is superset of 
the earlier one) thus fusing multiple MR jobs. Ysmart on the other hand instead 
replaces the second RS with a new operator its introducing 
(LocalSimulatedReduceSink?) which fakes the RS but doesn't let the plan split 
in 2 MR jobs and thus generating one MR job. I haven't thought through 
completely on this, but on initial pass it seems like approach of this patch is 
better than ysmart because:
* Here you don't need a new operator.
* Here you are simplifying the plan by eliminating the operators as oppose to 
ysmart which is replacing the operator thereby increasing the complexity of 
plan (by having a new type of operator)
* In that new operator ysmart currently serializes and deserializes the data 
through that operator, thereby unnecessarily introducing performance penalty. 
Granted this could be improved, but problem doesn't exist in patch proposed on 
this jira to begin with. 

Though there are certainly other scenarios which ysmart can cover (Yin, can you 
list those) which this patch is not covering, but for the scenarios that are 
common this approach seems to be better. 

There might be other differences in the approach, please feel free to raise 
those.
                
> optimize orderby followed by a groupby
> --------------------------------------
>
>                 Key: HIVE-2340
>                 URL: https://issues.apache.org/jira/browse/HIVE-2340
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>              Labels: perfomance
>         Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.1.patch, 
> ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.2.patch, 
> ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.3.patch, 
> ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.4.patch, 
> ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.5.patch, HIVE-2340.1.patch.txt
>
>
> Before implementing optimizer for JOIN-GBY, try to implement RS-GBY 
> optimizer(cluster-by following group-by).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2340) optimize orderby followed by a groupby

Reply via email to