Moreover, the IP 10.224.174.71 is a different node, not the one executing
the reduce task. Why did that happen?

On Fri, Oct 20, 2017 at 3:37 PM, Daniel Bruce <dbse...@gmail.com> wrote:

> OK, more updates. Today I was running the query with Yarn and also turned
> on DEBUG logging. Here's what I found from the task log for the dangling
> task.
>
> I noticed that after the RowContainer has been created by Hive, there are
> a lot of IPC/RPC related logs (still printing), every 3 second apart. My
> guess is Hive is reading back the rows one row at a time from disk? One row
> every 3 seconds? No wonder my job never ends..
>
> Could anyone confirm this? Appreciate it!
>
> 2017-10-20 13:43:23,144 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 1000 rows for join key [999]
> 2017-10-20 13:43:23,146 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 2000 rows for join key [999]
> 2017-10-20 13:43:23,153 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 4000 rows for join key [999]
> 2017-10-20 13:43:23,178 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 8000 rows for join key [999]
> 2017-10-20 13:43:23,205 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 16000 rows for join key [999]
> 2017-10-20 13:43:23,222 INFO [main] 
> org.apache.hadoop.hive.ql.exec.persistence.RowContainer:
> RowContainer created temp file /tmp/hadoop-cloud1/nm-local-
> dir/usercache/dbruce/appcache/application_1507669571623_
> 4416/container_e26_1507669571623_4416_01_000418/tmp/hive-
> rowcontainer5891981157220982604/RowContainer8985737720168249326.tmp
> 2017-10-20 13:43:23,354 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 32000 rows for join key [999]
> 2017-10-20 13:43:23,427 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 64000 rows for join key [999]
> 2017-10-20 13:43:23,597 INFO [main] 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator:
> table 0 has 128000 rows for join key [999]
> 2017-10-20 13:43:23,817 DEBUG [main] org.apache.hadoop.util.Progress:
> Illegal progress value found, progress is larger than 1. Progress will be
> changed to 1
> 2017-10-20 13:43:23,817 DEBUG [main] org.apache.hadoop.util.Progress:
> Illegal progress value found, progress is larger than 1. Progress will be
> changed to 1
> ...
> ...
> 2017-10-20 13:43:25,239 DEBUG [main] org.apache.hadoop.util.Progress:
> Illegal progress value found, progress is larger than 1. Progress will be
> changed to 1
> 2017-10-20 13:43:25,239 DEBUG [main] org.apache.hadoop.util.Progress:
> Illegal progress value found, progress is larger than 1. Progress will be
> changed to 1
> 2017-10-20 13:43:25,248 DEBUG [main] org.apache.hadoop.io.nativeio.NativeIO:
> Got UserName dbruce for ID 1573865840 from the native implementation
> 2017-10-20 13:43:25,249 DEBUG [main] org.apache.hadoop.io.nativeio.NativeIO:
> Got GroupName db_finan for ID 10100 from the native implementation
> 2017-10-20 13:43:25,249 DEBUG [main] org.apache.hadoop.mapred.FileInputFormat:
> Time taken to get FileStatuses: 7
> 2017-10-20 13:43:25,249 INFO [main] org.apache.hadoop.mapred.FileInputFormat:
> Total input paths to process : 1
> 2017-10-20 13:43:25,255 DEBUG [main] org.apache.hadoop.mapred.FileInputFormat:
> Total # of splits generated by getSplits: 1, TimeTaken: 14
> 2017-10-20 13:43:27,095 DEBUG [communication thread]
> org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 7382 7463 ]
> 2017-10-20 13:43:27,096 DEBUG [IPC Parameter Sending Thread #0]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 sending #7
> 2017-10-20 13:43:27,097 DEBUG [IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 got value #7
> 2017-10-20 13:43:27,097 DEBUG [communication thread]
> org.apache.hadoop.ipc.RPC: Call: statusUpdate 2
> 2017-10-20 13:43:30,161 DEBUG [communication thread]
> org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 7382 7463 ]
> 2017-10-20 13:43:30,161 DEBUG [IPC Parameter Sending Thread #0]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 sending #8
> 2017-10-20 13:43:30,162 DEBUG [IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 got value #8
> 2017-10-20 13:43:30,162 DEBUG [communication thread]
> org.apache.hadoop.ipc.RPC: Call: statusUpdate 1
> 2017-10-20 13:43:33,163 DEBUG [IPC Parameter Sending Thread #0]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 sending #9
> 2017-10-20 13:43:33,163 DEBUG [IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 got value #9
> 2017-10-20 13:43:33,163 DEBUG [communication thread]
> org.apache.hadoop.ipc.RPC: Call: ping 1
> 2017-10-20 13:43:36,164 DEBUG [IPC Parameter Sending Thread #0]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 sending #10
> 2017-10-20 13:43:36,165 DEBUG [IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416]
> org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
> 10.224.174.71:33647 from job_1507669571623_4416 got value #10
> 2017-10-20 13:43:36,165 DEBUG [communication thread]
> org.apache.hadoop.ipc.RPC: Call: ping 1
> ...
> ...
>
>
> On Thu, Oct 19, 2017 at 11:35 PM, Daniel Bruce <dbse...@gmail.com> wrote:
>
>> Hi Gopal,
>>
>> Thanks for your input! In my case I'm using MapReduce not Tez. I figured
>> I'd better be more specific so as to provide you more details.
>>
>> For this job there are 298 maps and 74 reduces. All the maps completed
>> real fast within 1 minute, and 73 reduces completed in about 2 minutes.
>>
>> Now there is only 1 reduce task running (forever). Here's a screenshot
>> for the job details: https://ibb.co/eBDj6R
>>
>> I noticed one thing interesting: MAP_OUTPUT_RECORDS and
>> REDUCE_INPUT_RECORDS don't match for the whole job (99,073,863 vs.
>> 98,105,913).
>>
>> Here's a screenshot for the counters of the dangling reduce task:
>> https://ibb.co/dHgyY6
>>
>> The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is 1. What
>> does it mean?
>>
>> For comparison, here's a screenshot for the counters of a different
>> reduce task which completed within 1 minute. It also has the ratio of 1:
>> https://ibb.co/mzoHRR
>>
>> Another comparison I did is for the tasklog. Below No. 1 is for the
>> dangling reduce task, and No. 2 and 3 are for two completed reduce tasks.
>> 1. https://ibb.co/caKVfm
>> 2. https://ibb.co/earJ0m
>> 3. https://ibb.co/edQiY6
>>
>> I don't understanding what the running reduce task is doing. Any other
>> logs that could be helpful?
>>
>> Regards,
>> Daniel
>>
>> On Thu, Oct 19, 2017 at 9:45 PM, Gopal Vijayaraghavan <gop...@apache.org>
>> wrote:
>>
>>> > . I didn't see data skew for that reducer. It has similar amount of
>>> REDUCE_INPUT_RECORDS as other reducers.
>>> …
>>> > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000
>>> rows for join key [4092813312923569]
>>>
>>>
>>> The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is
>>> relevant.
>>>
>>>
>>>
>>> The row containers being spilled to disk means that at least 1 key in
>>> the join has > 10000 values.
>>>
>>> If you have Tez, this comes up when you run the SkewAnalyzer.
>>>
>>> https://github.com/apache/tez/blob/master/tez-tools/analyzer
>>> s/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins
>>> /SkewAnalyzer.java#L41
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Gopal
>>>
>>
>>
>

Reply via email to