FULL OUTER JOIN Two Small Tables More Efficiently in Hive?

Ji Zhang Wed, 16 Oct 2013 21:53:50 -0700

Hi All,

I have two tables. One has 2,000,000 rows (150M in 6 files), and the
other has 5,000 rows (400K in 1 file). The join is (approximately) a
full outer join, since the city_id field has only 100 distinct values:


CREATE TABLE prop_total AS
SELECT *
FROM prop_1 a
JOIN prop_2 b
ON a.city_id = b.city_id

As a result, it will produce 150,000,000 rows, about 20G.

But during the execution, Hive only opens 4 mappers. It is very slow,
since in my (tiny) cluster, copying 20G with 100 mappers is usually
under 3 minutes.

I tried to set mapred.max.split.size to 15,000,000, and the mappers
increase to 6, which I want to increase to 100.

Is there a way to achieve this?

One thing I can think of is, say, split up the prop_1 table into 100
small files, and make hive open 100 mappers. But it seems.. odd.

Any suggestions will be appreciated.

Thanks in advance,
Jerry

FULL OUTER JOIN Two Small Tables More Efficiently in Hive?

Reply via email to