I'm inserting from an unpartitioned table with a 6 hours of data into a table partitioned by hour.
The source table is 400M rows and 500GB so it's needs a lot of reducers working on the data - Hive chose 544 which sounds good. But 538 reducers did nothing and the other 6 are working for over an hour with all the data. I see from running explain on the query: Map-reduce partition columns: _col54 (type: int), _col55 (type: int), _col56 (type: int), _col57 (type: int) which the partition columns of the destination table (year, month, day, hour). That's an unnecessary centralization of work, I don't need each partition to be written by only one reducer. Each destination partition should instead include a bunch of output files from various Reducers. If I wrote my own M/R job I would use MultipleOutputs and partition on epoch or something. So I hacked it, and added another column to the destination partition after the hour column- a random number up to 200. Now all the reducers are sharing the work. *Is there any other way I can get Hive to distribute the work to all reducers without hacking the table DDL with random columns?* I'm on Hive 0.13 with Beeline and HiveServer2 and start the query off with the settings: set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; Thanks