OK, more updates. Today I was running the query with Yarn and also turned
on DEBUG logging. Here's what I found from the task log for the dangling
task.

I noticed that after the RowContainer has been created by Hive, there are a
lot of IPC/RPC related logs (still printing), every 3 second apart. My
guess is Hive is reading back the rows one row at a time from disk? One row
every 3 seconds? No wonder my job never ends..

Could anyone confirm this? Appreciate it!

2017-10-20 13:43:23,144 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 1000 rows
for join key [999]
2017-10-20 13:43:23,146 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 2000 rows
for join key [999]
2017-10-20 13:43:23,153 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 4000 rows
for join key [999]
2017-10-20 13:43:23,178 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 rows
for join key [999]
2017-10-20 13:43:23,205 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 16000 rows
for join key [999]
2017-10-20 13:43:23,222 INFO [main]
org.apache.hadoop.hive.ql.exec.persistence.RowContainer: RowContainer
created temp file
/tmp/hadoop-cloud1/nm-local-dir/usercache/dbruce/appcache/application_1507669571623_4416/container_e26_1507669571623_4416_01_000418/tmp/hive-rowcontainer5891981157220982604/RowContainer8985737720168249326.tmp
2017-10-20 13:43:23,354 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 32000 rows
for join key [999]
2017-10-20 13:43:23,427 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 64000 rows
for join key [999]
2017-10-20 13:43:23,597 INFO [main]
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 128000 rows
for join key [999]
2017-10-20 13:43:23,817 DEBUG [main] org.apache.hadoop.util.Progress:
Illegal progress value found, progress is larger than 1. Progress will be
changed to 1
2017-10-20 13:43:23,817 DEBUG [main] org.apache.hadoop.util.Progress:
Illegal progress value found, progress is larger than 1. Progress will be
changed to 1
...
...
2017-10-20 13:43:25,239 DEBUG [main] org.apache.hadoop.util.Progress:
Illegal progress value found, progress is larger than 1. Progress will be
changed to 1
2017-10-20 13:43:25,239 DEBUG [main] org.apache.hadoop.util.Progress:
Illegal progress value found, progress is larger than 1. Progress will be
changed to 1
2017-10-20 13:43:25,248 DEBUG [main]
org.apache.hadoop.io.nativeio.NativeIO: Got UserName dbruce for ID
1573865840 from the native implementation
2017-10-20 13:43:25,249 DEBUG [main]
org.apache.hadoop.io.nativeio.NativeIO: Got GroupName db_finan for ID 10100
from the native implementation
2017-10-20 13:43:25,249 DEBUG [main]
org.apache.hadoop.mapred.FileInputFormat: Time taken to get FileStatuses: 7
2017-10-20 13:43:25,249 INFO [main]
org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
2017-10-20 13:43:25,255 DEBUG [main]
org.apache.hadoop.mapred.FileInputFormat: Total # of splits generated by
getSplits: 1, TimeTaken: 14
2017-10-20 13:43:27,095 DEBUG [communication thread]
org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 7382 7463 ]
2017-10-20 13:43:27,096 DEBUG [IPC Parameter Sending Thread #0]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 sending #7
2017-10-20 13:43:27,097 DEBUG [IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 got value #7
2017-10-20 13:43:27,097 DEBUG [communication thread]
org.apache.hadoop.ipc.RPC: Call: statusUpdate 2
2017-10-20 13:43:30,161 DEBUG [communication thread]
org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 7382 7463 ]
2017-10-20 13:43:30,161 DEBUG [IPC Parameter Sending Thread #0]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 sending #8
2017-10-20 13:43:30,162 DEBUG [IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 got value #8
2017-10-20 13:43:30,162 DEBUG [communication thread]
org.apache.hadoop.ipc.RPC: Call: statusUpdate 1
2017-10-20 13:43:33,163 DEBUG [IPC Parameter Sending Thread #0]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 sending #9
2017-10-20 13:43:33,163 DEBUG [IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 got value #9
2017-10-20 13:43:33,163 DEBUG [communication thread]
org.apache.hadoop.ipc.RPC: Call: ping 1
2017-10-20 13:43:36,164 DEBUG [IPC Parameter Sending Thread #0]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 sending #10
2017-10-20 13:43:36,165 DEBUG [IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416]
org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to /
10.224.174.71:33647 from job_1507669571623_4416 got value #10
2017-10-20 13:43:36,165 DEBUG [communication thread]
org.apache.hadoop.ipc.RPC: Call: ping 1
...
...


On Thu, Oct 19, 2017 at 11:35 PM, Daniel Bruce <dbse...@gmail.com> wrote:

> Hi Gopal,
>
> Thanks for your input! In my case I'm using MapReduce not Tez. I figured
> I'd better be more specific so as to provide you more details.
>
> For this job there are 298 maps and 74 reduces. All the maps completed
> real fast within 1 minute, and 73 reduces completed in about 2 minutes.
>
> Now there is only 1 reduce task running (forever). Here's a screenshot for
> the job details: https://ibb.co/eBDj6R
>
> I noticed one thing interesting: MAP_OUTPUT_RECORDS and
> REDUCE_INPUT_RECORDS don't match for the whole job (99,073,863 vs.
> 98,105,913).
>
> Here's a screenshot for the counters of the dangling reduce task:
> https://ibb.co/dHgyY6
>
> The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is 1. What does
> it mean?
>
> For comparison, here's a screenshot for the counters of a different reduce
> task which completed within 1 minute. It also has the ratio of 1:
> https://ibb.co/mzoHRR
>
> Another comparison I did is for the tasklog. Below No. 1 is for the
> dangling reduce task, and No. 2 and 3 are for two completed reduce tasks.
> 1. https://ibb.co/caKVfm
> 2. https://ibb.co/earJ0m
> 3. https://ibb.co/edQiY6
>
> I don't understanding what the running reduce task is doing. Any other
> logs that could be helpful?
>
> Regards,
> Daniel
>
> On Thu, Oct 19, 2017 at 9:45 PM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>> > . I didn't see data skew for that reducer. It has similar amount of
>> REDUCE_INPUT_RECORDS as other reducers.
>> …
>> > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000
>> rows for join key [4092813312923569]
>>
>>
>> The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is
>> relevant.
>>
>>
>>
>> The row containers being spilled to disk means that at least 1 key in the
>> join has > 10000 values.
>>
>> If you have Tez, this comes up when you run the SkewAnalyzer.
>>
>> https://github.com/apache/tez/blob/master/tez-tools/analyzer
>> s/job-analyzer/src/main/java/org/apache/tez/analyzer/
>> plugins/SkewAnalyzer.java#L41
>>
>>
>>
>> Cheers,
>>
>> Gopal
>>
>
>

Reply via email to