[ 
https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782893#comment-15782893
 ] 

Jesus Camacho Rodriguez commented on HIVE-15474:
------------------------------------------------

[~lirui], sorry for taking some time replying, I was away for a few days.

I have updated the description of this issue to try to explain better what I am 
trying to do. I think initial description was too brief and vague.

Given the physical plan you provide, we will propagate the limit from RS4 into 
RS2 (observe the explain plan above). RS2 produces the _top N_ keys for each 
partition; thus, GBY3 operator produces only results for those keys. Observe in 
the patch that there is no change for the GBY operator.

Concerning Spark. My current understanding is that the chain of operators is 
the same. But I was thinking further about it, and this optimization should not 
pose any problem in that context, since GBY logic has not changed. If Spark 
chooses to ignore RS2 since it is not sorting the input for GBY3, that should 
be fine: the limit is in the RS2 operator, not in GBY3. Spark will not benefit 
from the optimization, but it still remains correct.

{quote}
Actually I'm also wondering: if we use parallel order by (to use a range 
partitioner rather than a hash partitioner in RS2), we can do the groupBy and 
orderBy in a single stage, which may improve performance in some cases.
{quote}
It might be beneficial in some cases indeed. However, it is a complex 
cost-based decision which would need multiple extensions, as I can think on 
multiple factors that would influence it, e.g., data skew, number of records 
for the top N groups, the limit of records itself, etc.

> Extend limit propagation for chain of RS-GB-RS operators
> --------------------------------------------------------
>
>                 Key: HIVE-15474
>                 URL: https://issues.apache.org/jira/browse/HIVE-15474
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: HIVE-15474.patch
>
>
> The goal is to extend the work started in HIVE-14002.
> For instance, given the following query:
> {code:sql}
> explain
> select key, value, count(key + 1) as agg1 from src 
> group by key, value
> order by key, value, agg1 limit 20;
> {code}
> We generate the following physical plan:
> {{TS1 - GBY2 - RS3 - GBY4 - RS5 - SEL6 - LIM7 - FS8}}
> We can push the limit to RS3 operator, as we will generate records for the 
> _top N_ keys, and thus, GBY4 will produce the _top N_ results. However, 
> currently we do not do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to