Recently we tried to upgrade our hive from 0.9 to 0.10, but found some of
our hive queries almost 7 times slow.  One of such query consists multiple
table outer join on the same key. By looking into the query,  we found the
query plans generate by hive 0.9 and hive 0.10 are different.  Here is the
example:

testcase:

use default;
create table test_join (
  `key` string,
  `value` string
);

explain select
    sum(a.value) val
from default.test_join a
left outer join default.test_join b on a.key = b.key
left outer join default.test_join c on a.key = c.key
left outer join default.test_join d on a.key = d.key
left outer join default.test_join e on a.key = e.key
left outer join default.test_join f on a.key = f.key
left outer join default.test_join g on a.key = g.key


the explain of hive 0.9:

 STAGE DEPENDENCIES:

    Stage-1 is a root stage

    Stage-2 depends on stages: Stage-1

    Stage-0 is a root stage

...

Reduce Operator Tree:

          Join Operator

            condition map:

                 Left Outer Join0 to 1

                 Left Outer Join0 to 2

                 Left Outer Join0 to 3

                 Left Outer Join0 to 4

                 Left Outer Join0 to 5

                 Left Outer Join0 to 6

            condition expressions:

              0 {VALUE._col1}

              1

              2

              3

              4

              5

              6

......


while the explain of hive 0.10:

  STAGE DEPENDENCIES:

    Stage-6 is a root stage

    Stage-1 depends on stages: Stage-6

    Stage-2 depends on stages: Stage-1

    Stage-0 is a root stage

...

     Reduce Operator Tree:

          Join Operator

            condition map:

                 Left Outer Join0 to 1

                 Left Outer Join0 to 2

            condition expressions:

              0 {VALUE._col0} {VALUE._col1}

              1

              2

...

     Reduce Operator Tree:

          Join Operator

            condition map:

                 Left Outer Join0 to 1

                 Left Outer Join0 to 2

                 Left Outer Join0 to 3

                 Left Outer Join0 to 4

            condition expressions:

              0 {VALUE._col9}

              1

              2

              3

              4

....


It seems like hive 0.9 use only one stage/job to process all outer joins
but hive 0.10 split them into two stage.  When running such kind of query
on hive0.10 in production,  in the second stage of outer join process, some
reducer stucks.

I can't find any param to change the query plain ,  can anyone give me some
hint?

Thanks!

Reply via email to