[ https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496033#comment-16496033 ]
Gopal V commented on HIVE-17896: -------------------------------- [~teddy.choi]: the pushdown is too aggressive at the moment. The operator impl looks good, so it is the optimizer cases which I'm concerned about and would like you to split the multi-op push-down into a different patch. The best way to get the operator level changes committed right now would be to restrict the push-down to a simple case & let it bake-through for a few test cycles (specifically, the pushdown through GBY case in the JIRA: GBY->RS(TopN) => TNK->GBY->RS). For example. {code} tez/topnkey.q.out SELECT src1.key, src2.value FROM src src1 JOIN src src2 ON (src1.key = src2.key) ORDER BY src1.key LIMIT 5 {code} produces {code} + <-Reducer 2 [SIMPLE_EDGE] + SHUFFLE [RS_10] + Select Operator [SEL_9] (rows=809 width=178) + Output:["_col0","_col1"] + Top N Key Operator [TNK_18] (rows=809 width=178) + keys:_col0,sort order:+,top n:5 + Merge Join Operator [MERGEJOIN_21] (rows=809 width=178) + Conds:RS_6._col0=RS_7._col0(Inner),Output:["_col0","_col2"] + <-Map 1 [SIMPLE_EDGE] + SHUFFLE [RS_6] + PartitionCols:_col0 + Select Operator [SEL_2] (rows=500 width=87) ... + Filter Operator [FIL_16] (rows=500 width=87) + predicate:key is not null + Top N Key Operator [TNK_19] (rows=500 width=87) + keys:key,sort order:+,top n:5 + TableScan [TS_0] (rows=500 width=87) + default@src,src1,Tbl:COMPLETE,Col:COMPLETE,Output:["key"] {code} The TNK_18 is good, but the TNK_19 is likely to give incorrect results due to the nature of the join filter case. I'm happy to review these as two changes (i.e new operator & aggressive push-down) and get them in one by one. > TopNKey: Create a standalone vectorizable TopNKey operator > ---------------------------------------------------------- > > Key: HIVE-17896 > URL: https://issues.apache.org/jira/browse/HIVE-17896 > Project: Hive > Issue Type: New Feature > Components: Operators > Affects Versions: 3.0.0 > Reporter: Gopal V > Assignee: Teddy Choi > Priority: Major > Attachments: HIVE-17896.1.patch, HIVE-17896.3.patch, > HIVE-17896.4.patch, HIVE-17896.5.patch, HIVE-17896.6.patch, > HIVE-17896.7.patch, HIVE-17896.8.patch, HIVE-17896.9.patch > > > For TPC-DS Query27, the TopN operation is delayed by the group-by - the > group-by operator buffers up all the rows before discarding the 99% of the > rows in the TopN Hash within the ReduceSink Operator. > The RS TopN operator is very restrictive as it only supports doing the > filtering on the shuffle keys, but it is better to do this before breaking > the vectors into rows and losing the isRepeating properties. > Adding a TopN Key operator in the physical operator tree allows the following > to happen. > GBY->RS(Top=1) > can become > TNK(1)->GBY->RS(Top=1) > So that, the TopNKey can remove rows before they are buffered into the GBY > and consume memory. > Here's the equivalent implementation in Presto > https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35 > Adding this as a sub-feature of GroupBy prevents further optimizations if the > GBY is on keys "a,b,c" and the TopNKey is on just "a". -- This message was sent by Atlassian JIRA (v7.6.3#76005)