Hi Hive team,

I have a Hive query translated and running as 2000+ map and 1009 reduce
jobs. Reduce jobs are configured to run after all map jobs are completed.
In reduce phase, 1008 of those reduce jobs complete within 5 minutes, but
the one last reduce job takes more than 14 hours.

I expect to see reduce jobs complete roughly at the same time if I optimize
data skew. For example,I have set the following parameters to optimize data
skew. But it didn't help.
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000000;

Any idea what else parameters I need to set? Or how to optimize the run
time for reduce jobs?

Query is as follows:

WITH uaf AS
(
       SELECT user_id
       FROM   db1.table1
       WHERE  ds = '2018-11-25'
       AND    is_valid
       AND    days_since_last_visit = 0)
SELECT *
FROM   db2.table2 c
WHERE  c.user_id IN
       (
              SELECT user_id
              FROM   uaf)
AND Substr(datehour, 1, 8) = '20181125'
LIMIT 10

Da

Reply via email to