Recently we tried to upgrade our hive from 0.9 to 0.10, but found some of our hive queries almost 7 times slow. One of such query consists multiple table outer join on the same key. By looking into the query, we found the query plans generate by hive 0.9 and hive 0.10 are different. Here is the example:
testcase: use default; create table test_join ( `key` string, `value` string ); explain select sum(a.value) val from default.test_join a left outer join default.test_join b on a.key = b.key left outer join default.test_join c on a.key = c.key left outer join default.test_join d on a.key = d.key left outer join default.test_join e on a.key = e.key left outer join default.test_join f on a.key = f.key left outer join default.test_join g on a.key = g.key the explain of hive 0.9: STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage ... Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 Left Outer Join0 to 2 Left Outer Join0 to 3 Left Outer Join0 to 4 Left Outer Join0 to 5 Left Outer Join0 to 6 condition expressions: 0 {VALUE._col1} 1 2 3 4 5 6 ...... while the explain of hive 0.10: STAGE DEPENDENCIES: Stage-6 is a root stage Stage-1 depends on stages: Stage-6 Stage-2 depends on stages: Stage-1 Stage-0 is a root stage ... Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 Left Outer Join0 to 2 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 2 ... Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 Left Outer Join0 to 2 Left Outer Join0 to 3 Left Outer Join0 to 4 condition expressions: 0 {VALUE._col9} 1 2 3 4 .... It seems like hive 0.9 use only one stage/job to process all outer joins but hive 0.10 split them into two stage. When running such kind of query on hive0.10 in production, in the second stage of outer join process, some reducer stucks. I can't find any param to change the query plain , can anyone give me some hint? Thanks!