Hi All, I have two tables. One has 2,000,000 rows (150M in 6 files), and the other has 5,000 rows (400K in 1 file). The join is (approximately) a full outer join, since the city_id field has only 100 distinct values:
CREATE TABLE prop_total AS SELECT * FROM prop_1 a JOIN prop_2 b ON a.city_id = b.city_id As a result, it will produce 150,000,000 rows, about 20G. But during the execution, Hive only opens 4 mappers. It is very slow, since in my (tiny) cluster, copying 20G with 100 mappers is usually under 3 minutes. I tried to set mapred.max.split.size to 15,000,000, and the mappers increase to 6, which I want to increase to 100. Is there a way to achieve this? One thing I can think of is, say, split up the prop_1 table into 100 small files, and make hive open 100 mappers. But it seems.. odd. Any suggestions will be appreciated. Thanks in advance, Jerry