Hi Gopal, Thanks for your input! In my case I'm using MapReduce not Tez. I figured I'd better be more specific so as to provide you more details.
For this job there are 298 maps and 74 reduces. All the maps completed real fast within 1 minute, and 73 reduces completed in about 2 minutes. Now there is only 1 reduce task running (forever). Here's a screenshot for the job details: https://ibb.co/eBDj6R I noticed one thing interesting: MAP_OUTPUT_RECORDS and REDUCE_INPUT_RECORDS don't match for the whole job (99,073,863 vs. 98,105,913). Here's a screenshot for the counters of the dangling reduce task: https://ibb.co/dHgyY6 The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is 1. What does it mean? For comparison, here's a screenshot for the counters of a different reduce task which completed within 1 minute. It also has the ratio of 1: https://ibb.co/mzoHRR Another comparison I did is for the tasklog. Below No. 1 is for the dangling reduce task, and No. 2 and 3 are for two completed reduce tasks. 1. https://ibb.co/caKVfm 2. https://ibb.co/earJ0m 3. https://ibb.co/edQiY6 I don't understanding what the running reduce task is doing. Any other logs that could be helpful? Regards, Daniel On Thu, Oct 19, 2017 at 9:45 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > . I didn't see data skew for that reducer. It has similar amount of > REDUCE_INPUT_RECORDS as other reducers. > … > > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 > rows for join key [4092813312923569] > > > The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is > relevant. > > > > The row containers being spilled to disk means that at least 1 key in the > join has > 10000 values. > > If you have Tez, this comes up when you run the SkewAnalyzer. > > https://github.com/apache/tez/blob/master/tez-tools/ > analyzers/job-analyzer/src/main/java/org/apache/tez/ > analyzer/plugins/SkewAnalyzer.java#L41 > > > > Cheers, > > Gopal >