[ https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747626#comment-13747626 ]
Phabricator commented on HIVE-3562: ----------------------------------- ashutoshc has requested changes to the revision "HIVE-3562 [jira] Some limit can be pushed down to map stage". INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I think we don't need a boolean flag. If user has set hive.limit.pushdown.memory.usage=0 it is pretty clear he wants the optimization to be disabled. ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 I didnt fully follow the example here. What are the keys/values/rows here? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 select key from src limit 0; implies user doesn't want any data, so RS doesnt need to emit any thing, this optimization is still valid and super useful here. Isn't it? So, this should be limit < 0, instead limit <=0 ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Can you add a comment why you are subtracting limit * 64 in this calculation of threshold? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:171 I think we need not to special handle limit = 0 by null'ing the collector. As I said above better is to make RSHash to null in that case. This avoids a null check of collector in processOp(). ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 As I noted above, lets not have this null check. Apart from you null'ing the collector above, could there ever be case when it will be null. I dont think so, in that case lets remove this if-branch from inner loop. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Shall this be instead collect(keyWritable, value) ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 Didn't get your comment for retry here. Can you add more comment for it? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 collect() instead of out.collect() ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 transient? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 It will be good to add a LOG.DEBUG() here saying, disabling ReduceSinkHash. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I didn't get your TODO here. Can you explain? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 Better name: remove ? ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Add a comment here saying values[index] could be null, if flush() was called after this value is added but before remove() is called. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 You are using PriorityQueue almost like a sorted set. I think java.util.TreeSet will suffice what you need here. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I dont see you making use of any feature corresponding to Map interface. Looks like TreeSet would be sufficient here as well. If it so that at both places we can use TreeSet than we do need not these two diff implementation of ReduceSinkHash. This will simplify code quite a bit. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 This handling of distinct keys case can also be refactored to use ReduceSinkHash. Will be a good idea to take this up in a follow up jira. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Also need to mark it transient. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:91 If we are able to use one DataStructure in ReduceSinkHash (either TreeSet or some other) for both cases, we wont need this boolean. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 How are enforcing this condition of applying this optimization only for last RS. Can you add comments about this. ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a test for following query with this optimization on: explain select distinct(key) from src limit 20; explain select count(distinct(key)) from src group by key limit 20; select distinct(key) from src limit 20; select count(distinct(key)) from src group by key limit 20; ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to seprate these queries where optimization will not be fired in a seprate .q file. REVISION DETAIL https://reviews.facebook.net/D5967 BRANCH HIVE-3562 ARCANIST PROJECT hive To: JIRA, tarball, ashutoshc, navis Cc: njain > Some limit can be pushed down to map stage > ------------------------------------------ > > Key: HIVE-3562 > URL: https://issues.apache.org/jira/browse/HIVE-3562 > Project: Hive > Issue Type: Bug > Reporter: Navis > Assignee: Navis > Priority: Trivial > Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, > HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, > HIVE-3562.D5967.6.patch > > > Queries with limit clause (with reasonable number), for example > {noformat} > select * from src order by key limit 10; > {noformat} > makes operator tree, > TS-SEL-RS-EXT-LIMIT-FS > But LIMIT can be partially calculated in RS, reducing size of shuffling. > TS-SEL-RS(TOP-N)-EXT-LIMIT-FS -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira