[ https://issues.apache.org/jira/browse/HIVE-10996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593260#comment-14593260 ]
Jesus Camacho Rodriguez commented on HIVE-10996: ------------------------------------------------ [~jpullokkaran], thanks for the comments. 1) you are right, the solution should not be tailored towards FIL operator only, as this could happen in other cases; I will change the patch accordingly, 2) the error is produced because the SEL operator above the FIL operator is removed by IdentityProjectRemoval optimization (as SEL and FIL have the same schema, this is what IdentityProjectRemoval is supposed to do). But actually that SEL operator was pruning columns out of the input tuples. In the specific example provided in this case, after the SEL operator there is a JOIN operator, that was then joining on wrong columns and thus, not producing the right results. > Aggregation / Projection over Multi-Join Inner Query producing incorrect > results > -------------------------------------------------------------------------------- > > Key: HIVE-10996 > URL: https://issues.apache.org/jira/browse/HIVE-10996 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 1.0.0, 1.2.0, 1.1.0, 1.3.0, 2.0.0 > Reporter: Gautam Kowshik > Assignee: Jesus Camacho Rodriguez > Priority: Critical > Attachments: HIVE-10996.01.patch, HIVE-10996.02.patch, > HIVE-10996.03.patch, HIVE-10996.patch, explain_q1.txt, explain_q2.txt > > > We see the following problem on 1.1.0 and 1.2.0 but not 0.13 which seems like > a regression. > The following query (Q1) produces no results: > {code} > select s > from ( > select last.*, action.st2, action.n > from ( > select purchase.s, purchase.timestamp, max (mevt.timestamp) as > last_stage_timestamp > from (select * from purchase_history) purchase > join (select * from cart_history) mevt > on purchase.s = mevt.s > where purchase.timestamp > mevt.timestamp > group by purchase.s, purchase.timestamp > ) last > join (select * from events) action > on last.s = action.s and last.last_stage_timestamp = action.timestamp > ) list; > {code} > While this one (Q2) does produce results : > {code} > select * > from ( > select last.*, action.st2, action.n > from ( > select purchase.s, purchase.timestamp, max (mevt.timestamp) as > last_stage_timestamp > from (select * from purchase_history) purchase > join (select * from cart_history) mevt > on purchase.s = mevt.s > where purchase.timestamp > mevt.timestamp > group by purchase.s, purchase.timestamp > ) last > join (select * from events) action > on last.s = action.s and last.last_stage_timestamp = action.timestamp > ) list; > 1 21 20 Bob 1234 > 1 31 30 Bob 1234 > 3 51 50 Jeff 1234 > {code} > The setup to test this is: > {code} > create table purchase_history (s string, product string, price double, > timestamp int); > insert into purchase_history values ('1', 'Belt', 20.00, 21); > insert into purchase_history values ('1', 'Socks', 3.50, 31); > insert into purchase_history values ('3', 'Belt', 20.00, 51); > insert into purchase_history values ('4', 'Shirt', 15.50, 59); > create table cart_history (s string, cart_id int, timestamp int); > insert into cart_history values ('1', 1, 10); > insert into cart_history values ('1', 2, 20); > insert into cart_history values ('1', 3, 30); > insert into cart_history values ('1', 4, 40); > insert into cart_history values ('3', 5, 50); > insert into cart_history values ('4', 6, 60); > create table events (s string, st2 string, n int, timestamp int); > insert into events values ('1', 'Bob', 1234, 20); > insert into events values ('1', 'Bob', 1234, 30); > insert into events values ('1', 'Bob', 1234, 25); > insert into events values ('2', 'Sam', 1234, 30); > insert into events values ('3', 'Jeff', 1234, 50); > insert into events values ('4', 'Ted', 1234, 60); > {code} > I realize select * and select s are not all that interesting in this context > but what lead us to this issue was select count(distinct s) was not returning > results. The above queries are the simplified queries that produce the issue. > I will note that if I convert the inner join to a table and select from that > the issue does not appear. > Update: Found that turning off hive.optimize.remove.identity.project fixes > this issue. This optimization was introduced in > https://issues.apache.org/jira/browse/HIVE-8435 -- This message was sent by Atlassian JIRA (v6.3.4#6332)