[ https://issues.apache.org/jira/browse/HIVE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556513#comment-13556513 ]
Ashutosh Chauhan commented on HIVE-2340: ---------------------------------------- Yeah, correct JOIN-GBY and GBY-GBY are taken care of in ysmart also. Its the group-by followed by order-by case which is also of interest to me, which this already covers. Besides the scenario covered by these two patches, I am also comparing the approaches taken in these two. I have just briefly looked at this patch, but fundamental difference which I can make out in this approach Vs ysmart approach is that here RS is deduplicated that is completely removed from operator pipeline, wherever it could be (i.e. when keys of subsequent RS is superset of the earlier one) thus fusing multiple MR jobs. Ysmart on the other hand instead replaces the second RS with a new operator its introducing (LocalSimulatedReduceSink?) which fakes the RS but doesn't let the plan split in 2 MR jobs and thus generating one MR job. I haven't thought through completely on this, but on initial pass it seems like approach of this patch is better than ysmart because: * Here you don't need a new operator. * Here you are simplifying the plan by eliminating the operators as oppose to ysmart which is replacing the operator thereby increasing the complexity of plan (by having a new type of operator) * In that new operator ysmart currently serializes and deserializes the data through that operator, thereby unnecessarily introducing performance penalty. Granted this could be improved, but problem doesn't exist in patch proposed on this jira to begin with. Though there are certainly other scenarios which ysmart can cover (Yin, can you list those) which this patch is not covering, but for the scenarios that are common this approach seems to be better. There might be other differences in the approach, please feel free to raise those. > optimize orderby followed by a groupby > -------------------------------------- > > Key: HIVE-2340 > URL: https://issues.apache.org/jira/browse/HIVE-2340 > Project: Hive > Issue Type: Sub-task > Components: Query Processor > Reporter: Navis > Assignee: Navis > Priority: Minor > Labels: perfomance > Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.1.patch, > ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.2.patch, > ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.3.patch, > ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.4.patch, > ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.5.patch, HIVE-2340.1.patch.txt > > > Before implementing optimizer for JOIN-GBY, try to implement RS-GBY > optimizer(cluster-by following group-by). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira