[ https://issues.apache.org/jira/browse/HIVE-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958423#comment-13958423 ]
Harish Butani commented on HIVE-6819: ------------------------------------- Problem is: - Column Pruning is not pruning past the Limit Operator. So the first MR job has the FilterOp whose inputSchema has the all the columns but its outputSchema is the pruned column list. - But at runtime we assume the Operators on the Reduce side have the same schema as the ExtractOperator, unless you hit a SelectOperator. So in this case the FilterOp and subsequent FileSink output rows with all the columns. - So you end up in a state: where the 2nd job is expecting a pruned row, but gets rows with all columns. This is probably because we didn't expect to see queries with OrderBy/Limit in SubQuery blocks. So ColumnPruner is not handling this case correctly. Investigating some more on this point... > Correctness issue with Hive limit operator & predicate push down > ---------------------------------------------------------------- > > Key: HIVE-6819 > URL: https://issues.apache.org/jira/browse/HIVE-6819 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.12.0 > Reporter: Laljo John Pullokkaran > Assignee: Laljo John Pullokkaran > Fix For: 0.13.0 > > > Following query produces 0 rows with Predicate Push Down optimization turned > on; the same query produces 130 rows with predicate push down turned off. > select t2.c_int from (select key, value, c_float, c_int from t1 order by > key,value,c_float,c_int limit 10)t1 join t2 on t1.c_int=t2.c_int and > t1.c_float=t2.c_float where t2.c_int>=1; > I could reproduce this on Apache Trunk. > Haven't checked if previous releases have the same issue. > hive> desc t1; > Query ID = jpullokkaran_20140401191515_36e441c6-074b-45ae-aff6-489e13a6f401 > OK > key string > value string > c_int int > c_float float > c_boolean boolean > Time taken: 0.077 seconds, Fetched: 5 row(s) > hive> select distinct key, value, c_float, c_int from t1; > OK > 1 1 1.0 1 > 1 1 1.0 1 > 1 1 1.0 1 > 1 1 1.0 1 > null null NULL NULL > Time taken: 0.062 seconds, Fetched: 5 row(s) > hive> desc t2; > Query ID = jpullokkaran_20140401191616_dfbd14bb-b5b8-4165-8d01-e9a61a7f1c33 > OK > key string > value string > c_int int > c_float float > c_boolean boolean > Time taken: 0.062 seconds, Fetched: 5 row(s) > hive> select distinct key, value, c_float, c_int from t2; > OK > 1 1 1.0 1 > 1 1 1.0 1 > 1 1 1.0 1 > 1 1 1.0 1 > 2 2 2.0 2 > null null NULL NULL > Time taken: 4.698 seconds, Fetched: 6 row(s) > hive> select t2.c_int from (select key, value, c_float, c_int from t1 order > by key,value,c_float,c_int limit 10)t1 join t2 on t1.c_int=t2.c_int and > t1.c_float=t2.c_float where t2.c_int>=1; > MapredLocal task succeeded > OK > Time taken: 13.029 seconds > hive> > hive> select t2.c_int from (select key, value, c_float, c_int from t1 order > by key,value,c_float,c_int limit 10)t1 join t2 on t1.c_int=t2.c_int and > t1.c_float=t2.c_float where t2.c_int>=1; > MapredLocal task succeeded > OK > ... > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > Time taken: 9.317 seconds, Fetched: 130 row(s) > hive> -- This message was sent by Atlassian JIRA (v6.2#6252)