[ https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782893#comment-15782893 ]
Jesus Camacho Rodriguez commented on HIVE-15474: ------------------------------------------------ [~lirui], sorry for taking some time replying, I was away for a few days. I have updated the description of this issue to try to explain better what I am trying to do. I think initial description was too brief and vague. Given the physical plan you provide, we will propagate the limit from RS4 into RS2 (observe the explain plan above). RS2 produces the _top N_ keys for each partition; thus, GBY3 operator produces only results for those keys. Observe in the patch that there is no change for the GBY operator. Concerning Spark. My current understanding is that the chain of operators is the same. But I was thinking further about it, and this optimization should not pose any problem in that context, since GBY logic has not changed. If Spark chooses to ignore RS2 since it is not sorting the input for GBY3, that should be fine: the limit is in the RS2 operator, not in GBY3. Spark will not benefit from the optimization, but it still remains correct. {quote} Actually I'm also wondering: if we use parallel order by (to use a range partitioner rather than a hash partitioner in RS2), we can do the groupBy and orderBy in a single stage, which may improve performance in some cases. {quote} It might be beneficial in some cases indeed. However, it is a complex cost-based decision which would need multiple extensions, as I can think on multiple factors that would influence it, e.g., data skew, number of records for the top N groups, the limit of records itself, etc. > Extend limit propagation for chain of RS-GB-RS operators > -------------------------------------------------------- > > Key: HIVE-15474 > URL: https://issues.apache.org/jira/browse/HIVE-15474 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 2.2.0 > Reporter: Jesus Camacho Rodriguez > Assignee: Jesus Camacho Rodriguez > Attachments: HIVE-15474.patch > > > The goal is to extend the work started in HIVE-14002. > For instance, given the following query: > {code:sql} > explain > select key, value, count(key + 1) as agg1 from src > group by key, value > order by key, value, agg1 limit 20; > {code} > We generate the following physical plan: > {{TS1 - GBY2 - RS3 - GBY4 - RS5 - SEL6 - LIM7 - FS8}} > We can push the limit to RS3 operator, as we will generate records for the > _top N_ keys, and thus, GBY4 will produce the _top N_ results. However, > currently we do not do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)