I'm inserting from an unpartitioned table with a 6 hours of data into a
table partitioned by hour.

The source table is 400M rows and 500GB so it's needs a lot of reducers
working on the data - Hive chose 544 which sounds good.

But 538 reducers did nothing and the other 6 are working for over an hour
with all the data.

I see from running explain on the query:
Map-reduce partition columns: _col54 (type: int), _col55 (type: int),
_col56 (type: int), _col57 (type: int)

which the partition columns of the destination table (year, month, day,
hour).
That's an unnecessary centralization of work, I don't need each partition
to be written by only one reducer.  Each destination partition should
instead include a bunch of output files from various Reducers.  If I wrote
my own M/R job I would use MultipleOutputs and partition on epoch or
something.

So I hacked it, and added another column to the destination partition after
the hour column- a random number up to 200.  Now all the reducers are
sharing the work.

*Is there any other way I can get Hive to distribute the work to all
reducers without hacking the table DDL with random columns?*

I'm on Hive 0.13 with Beeline and HiveServer2 and start the query off with
the settings:
set  hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Thanks

Reply via email to