Moreover, the IP 10.224.174.71 is a different node, not the one executing the reduce task. Why did that happen?
On Fri, Oct 20, 2017 at 3:37 PM, Daniel Bruce <dbse...@gmail.com> wrote: > OK, more updates. Today I was running the query with Yarn and also turned > on DEBUG logging. Here's what I found from the task log for the dangling > task. > > I noticed that after the RowContainer has been created by Hive, there are > a lot of IPC/RPC related logs (still printing), every 3 second apart. My > guess is Hive is reading back the rows one row at a time from disk? One row > every 3 seconds? No wonder my job never ends.. > > Could anyone confirm this? Appreciate it! > > 2017-10-20 13:43:23,144 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 1000 rows for join key [999] > 2017-10-20 13:43:23,146 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 2000 rows for join key [999] > 2017-10-20 13:43:23,153 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 4000 rows for join key [999] > 2017-10-20 13:43:23,178 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 8000 rows for join key [999] > 2017-10-20 13:43:23,205 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 16000 rows for join key [999] > 2017-10-20 13:43:23,222 INFO [main] > org.apache.hadoop.hive.ql.exec.persistence.RowContainer: > RowContainer created temp file /tmp/hadoop-cloud1/nm-local- > dir/usercache/dbruce/appcache/application_1507669571623_ > 4416/container_e26_1507669571623_4416_01_000418/tmp/hive- > rowcontainer5891981157220982604/RowContainer8985737720168249326.tmp > 2017-10-20 13:43:23,354 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 32000 rows for join key [999] > 2017-10-20 13:43:23,427 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 64000 rows for join key [999] > 2017-10-20 13:43:23,597 INFO [main] > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: > table 0 has 128000 rows for join key [999] > 2017-10-20 13:43:23,817 DEBUG [main] org.apache.hadoop.util.Progress: > Illegal progress value found, progress is larger than 1. Progress will be > changed to 1 > 2017-10-20 13:43:23,817 DEBUG [main] org.apache.hadoop.util.Progress: > Illegal progress value found, progress is larger than 1. Progress will be > changed to 1 > ... > ... > 2017-10-20 13:43:25,239 DEBUG [main] org.apache.hadoop.util.Progress: > Illegal progress value found, progress is larger than 1. Progress will be > changed to 1 > 2017-10-20 13:43:25,239 DEBUG [main] org.apache.hadoop.util.Progress: > Illegal progress value found, progress is larger than 1. Progress will be > changed to 1 > 2017-10-20 13:43:25,248 DEBUG [main] org.apache.hadoop.io.nativeio.NativeIO: > Got UserName dbruce for ID 1573865840 from the native implementation > 2017-10-20 13:43:25,249 DEBUG [main] org.apache.hadoop.io.nativeio.NativeIO: > Got GroupName db_finan for ID 10100 from the native implementation > 2017-10-20 13:43:25,249 DEBUG [main] org.apache.hadoop.mapred.FileInputFormat: > Time taken to get FileStatuses: 7 > 2017-10-20 13:43:25,249 INFO [main] org.apache.hadoop.mapred.FileInputFormat: > Total input paths to process : 1 > 2017-10-20 13:43:25,255 DEBUG [main] org.apache.hadoop.mapred.FileInputFormat: > Total # of splits generated by getSplits: 1, TimeTaken: 14 > 2017-10-20 13:43:27,095 DEBUG [communication thread] > org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 7382 7463 ] > 2017-10-20 13:43:27,096 DEBUG [IPC Parameter Sending Thread #0] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 sending #7 > 2017-10-20 13:43:27,097 DEBUG [IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 got value #7 > 2017-10-20 13:43:27,097 DEBUG [communication thread] > org.apache.hadoop.ipc.RPC: Call: statusUpdate 2 > 2017-10-20 13:43:30,161 DEBUG [communication thread] > org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 7382 7463 ] > 2017-10-20 13:43:30,161 DEBUG [IPC Parameter Sending Thread #0] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 sending #8 > 2017-10-20 13:43:30,162 DEBUG [IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 got value #8 > 2017-10-20 13:43:30,162 DEBUG [communication thread] > org.apache.hadoop.ipc.RPC: Call: statusUpdate 1 > 2017-10-20 13:43:33,163 DEBUG [IPC Parameter Sending Thread #0] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 sending #9 > 2017-10-20 13:43:33,163 DEBUG [IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 got value #9 > 2017-10-20 13:43:33,163 DEBUG [communication thread] > org.apache.hadoop.ipc.RPC: Call: ping 1 > 2017-10-20 13:43:36,164 DEBUG [IPC Parameter Sending Thread #0] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 sending #10 > 2017-10-20 13:43:36,165 DEBUG [IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416] > org.apache.hadoop.ipc.Client: IPC Client (13419636) connection to / > 10.224.174.71:33647 from job_1507669571623_4416 got value #10 > 2017-10-20 13:43:36,165 DEBUG [communication thread] > org.apache.hadoop.ipc.RPC: Call: ping 1 > ... > ... > > > On Thu, Oct 19, 2017 at 11:35 PM, Daniel Bruce <dbse...@gmail.com> wrote: > >> Hi Gopal, >> >> Thanks for your input! In my case I'm using MapReduce not Tez. I figured >> I'd better be more specific so as to provide you more details. >> >> For this job there are 298 maps and 74 reduces. All the maps completed >> real fast within 1 minute, and 73 reduces completed in about 2 minutes. >> >> Now there is only 1 reduce task running (forever). Here's a screenshot >> for the job details: https://ibb.co/eBDj6R >> >> I noticed one thing interesting: MAP_OUTPUT_RECORDS and >> REDUCE_INPUT_RECORDS don't match for the whole job (99,073,863 vs. >> 98,105,913). >> >> Here's a screenshot for the counters of the dangling reduce task: >> https://ibb.co/dHgyY6 >> >> The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is 1. What >> does it mean? >> >> For comparison, here's a screenshot for the counters of a different >> reduce task which completed within 1 minute. It also has the ratio of 1: >> https://ibb.co/mzoHRR >> >> Another comparison I did is for the tasklog. Below No. 1 is for the >> dangling reduce task, and No. 2 and 3 are for two completed reduce tasks. >> 1. https://ibb.co/caKVfm >> 2. https://ibb.co/earJ0m >> 3. https://ibb.co/edQiY6 >> >> I don't understanding what the running reduce task is doing. Any other >> logs that could be helpful? >> >> Regards, >> Daniel >> >> On Thu, Oct 19, 2017 at 9:45 PM, Gopal Vijayaraghavan <gop...@apache.org> >> wrote: >> >>> > . I didn't see data skew for that reducer. It has similar amount of >>> REDUCE_INPUT_RECORDS as other reducers. >>> … >>> > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 >>> rows for join key [4092813312923569] >>> >>> >>> The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is >>> relevant. >>> >>> >>> >>> The row containers being spilled to disk means that at least 1 key in >>> the join has > 10000 values. >>> >>> If you have Tez, this comes up when you run the SkewAnalyzer. >>> >>> https://github.com/apache/tez/blob/master/tez-tools/analyzer >>> s/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins >>> /SkewAnalyzer.java#L41 >>> >>> >>> >>> Cheers, >>> >>> Gopal >>> >> >> >