Control large file output in dynamic partitioned insert

Patrick Duin Mon, 24 Sep 2018 09:22:45 -0700

Hi all,

I got a query doing an insert overwrite like this:


WITH tbl1 AS (
  SELECT
   col0, col1, local_date, local_hour
  FROM tbl1
  WHERE ....
  ),
tbl2 AS (
  SELECT col0, col1, local_date, local_hour
  FROM tbl2
  WHERE ....
)
INSERT OVERWRITE TABLE
  tbl3
PARTITION (local_date, local_hour)
  SELECT * FROM tbl1
  UNION DISTINCT
  SELECT * FROM tbl2

Each partition contains ~15GB of compressed Parquet data.. The input tables
are AVRO.
I'm running on hive-2.3.2 using Tez on EMR in AWS.
My problem is that the final reduce phase spins up with x reducers where x
is let's say 60. If my query has only 3 partitions to write, all reducers
are done very quickly except 3 of them that write the file.
This makes the whole process very slow and also creates one very large part
file in each output partition.
I've been trying to control the size of the files and force multiple
reducers writing to the same output partition.

Settings I've been using:
set hive.exec.dynamic.partition=false;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=2000;
set hive.exec.max.dynamic.partitions=20000;
set mapred.reduce.task=9;

any ideas?

Thanks,
 Patrick

Control large file output in dynamic partitioned insert

Reply via email to