Hi all, I got a query doing an insert overwrite like this:
WITH tbl1 AS ( SELECT col0, col1, local_date, local_hour FROM tbl1 WHERE .... ), tbl2 AS ( SELECT col0, col1, local_date, local_hour FROM tbl2 WHERE .... ) INSERT OVERWRITE TABLE tbl3 PARTITION (local_date, local_hour) SELECT * FROM tbl1 UNION DISTINCT SELECT * FROM tbl2 Each partition contains ~15GB of compressed Parquet data.. The input tables are AVRO. I'm running on hive-2.3.2 using Tez on EMR in AWS. My problem is that the final reduce phase spins up with x reducers where x is let's say 60. If my query has only 3 partitions to write, all reducers are done very quickly except 3 of them that write the file. This makes the whole process very slow and also creates one very large part file in each output partition. I've been trying to control the size of the files and force multiple reducers writing to the same output partition. Settings I've been using: set hive.exec.dynamic.partition=false; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=2000; set hive.exec.max.dynamic.partitions=20000; set mapred.reduce.task=9; any ideas? Thanks, Patrick