Hi Cheng, It would really help if you could simplify your example so it's easy to reproduce by one of us.
Also, can you share the value of hive.optimize.ppd property by doing hive> set hive.optimize.ppd; Does inverting the property (making it false, if it's true) change anything? On Thu, Nov 15, 2012 at 8:18 AM, Chen Song <chen.song...@gmail.com> wrote: > Hi Folks > > We are getting inconsistent output when running some semantically same Hive > queries. We are using CDH3u3 with Hive 0.7.1. > > The query I am running performs a map-side join of two subqueries (s1 and > s2)--s1 is a multi-table join and s2 is a union of a table against itself. > If we add in mapside join to s1 (which is necessary for performance) the > query produces different (and incorrect) results. > > Basically, s1 returns one row and s2 returns 2 rows. By joining them, it > should give us two rows. See below for output and queries (I abridged the > queries a bit for readability). > > NOTE: I searched online and found anther thread reporting a similar issue as > what I have seen, > http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/%3CCADejTpz5rSf-Sxi9HNoCTRHVazOk=F+bu6XPL=k7tbknqj9...@mail.gmail.com%3E. > Unfortunately, this thread never got answered. It looks like a bug in Hive. > Greatly appreciate if anyone give thoughts on this issue. > > > Q1 output (incorrect, missing one row): > 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 2012-08-15 > 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 1 0 NULL > > Q2 output (correct): > 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 2012-08-15 > 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 0 > 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 2012-08-15 > 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 1 > > Q1 and Q2 only differs in the way that s1 in Q1 has a mapside join. > > Q1 (incorrect results, highlighted portion is the only difference from Q2): > > > SELECT /*+MAPJOIN(table) */ * > FROM > ( > SELECT /*+ MAPJOIN(tableA,tableB) */ > id, > ... > FROM main_table ctg > JOIN sample smp > ON ctg.id = smp.id > LEFT OUTER JOIN tableA > ON smp.publisher_id = tableA.id > LEFT OUTER JOIN tableB > ON smp.inventory_source_id = tableB.id > WHERE ctg.date_time >= '2012-08-15 00:00:00' and > ctg.date_time <= '2012-08-15 23:59:59' and > ctg.dd = '2012-08-15' and > smp.date_time >= '2012-08-15 00:00:00' and > smp.date_time <= '2012-08-15 23:59:59' and > smp.dd = '2012-08-15' > )sub > JOIN > ( > select > id, > parent_category_id, > cast(0 as tinyint) as is_parent > from table_category > where is_system = 1 > union all > select > id, > parent_category_id, > cast(1 as tinyint) as is_parent > from table_category > where is_system = 1 > )table > ON table.id = sub.id > > Q2 (correct results): > > SELECT /*+MAPJOIN(table) */ * > FROM > ( > SELECT > id, > ... > FROM main_table ctg > JOIN sample smp > ON ctg.id = smp.id > LEFT OUTER JOIN tableA > ON smp.publisher_id = tableA.id > LEFT OUTER JOIN tableB > ON smp.inventory_source_id = tableB.id > WHERE ctg.date_time >= '2012-08-15 00:00:00' and > ctg.date_time <= '2012-08-15 23:59:59' and > ctg.dd = '2012-08-15' and > smp.date_time >= '2012-08-15 00:00:00' and > smp.date_time <= '2012-08-15 23:59:59' and > smp.dd = '2012-08-15' > )sub > JOIN > ( > select > id, > parent_category_id, > cast(0 as tinyint) as is_parent > from table_category > where is_system = 1 > union all > select > id, > parent_category_id, > cast(1 as tinyint) as is_parent > from table_category > where is_system = 1 > )table > ON table.id = sub.id > > -- > Chen Song > >