Harish Butani created HIVE-5873: ----------------------------------- Summary: SubQuery: In subquery Count Bug Key: HIVE-5873 URL: https://issues.apache.org/jira/browse/HIVE-5873 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Harish Butani
This is from the Optimization of Nested SQl Queries Revisited paper: http://dl.acm.org/citation.cfm?id=38723 Consider Part table having: {noformat} PNum OrderOnHand -------- ------------------ 3 6 10 1 8 0 {noformat} Supply table having: {noformat} PNum Qty 3 4 3 2 10 1 {noformat} The query: {noformat} select pnum from parts p where orderOnHand in (select count(*) from supply s where s.pnum = p.pnum ) {noformat} should return the row with PNum=8. But a transformation to a semi-join would eliminate this row, as there are no rows in supply table with PNum=8. AS shown in the paper the soln is to transform to: {noformat} select pnum from parts p semijoin (select p1.pnum, count(*) as c from (select distinct pnum from parts) p1 join supply s where s.pnum = p1.pnum ) sq on p.pnum = sq.pnum and p.orderOnHand = sq.c {noformat} The additional distinct query within the SubQuery is to handle duplicates in the outer query on the joining columns. -- This message was sent by Atlassian JIRA (v6.1#6144)