Re: join operation fails on big data set

Johnny Zhang Mon, 15 Apr 2013 10:49:31 -0700

Hi, Mua:
Your log has
2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-199210310173610155_28360 *bad datanode[0]**
10.6.25.33:49197*
2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-199210310173610155_28360 in pipeline
10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: *bad datanode**
10.6.25.33:49197*


can you check your datanode 10.6.25.33:49197 ? You can either
log in to that node to check if the datanode daemon is on
or you can go to your namenode URL <namenode_IP>:50070/dfshealth.jsp   (it
shows how many live DN, how many dead DN)
or you can <namenode_IP>:50070/dfsclusterhealth.jsp  (it shows how many
live DN, how many dead DN)


You can bump your log level by open /etc/pig/conf/pig.properties and
change debug=DEBUG (sorry for confusing, I didn't want to mean log4j debug
level here....)
this way, you will have something like 2013-04-15 10:46:04,069 [main] DEBUG
xxxxxxxxx in your console output...

Johnny


On Sun, Apr 14, 2013 at 6:13 AM, Mua Ban <[email protected]> wrote:

> Hi Johnny,
>
> Thank you very much for your email. I am very new here. Please tell me
> where to check the health of the data node (which log file should I look
> at?), and how to set the logging level of log4j to DEBUG.
>
> Thanks,
> -Mua
>
>
> On Fri, Apr 12, 2013 at 5:01 PM, Johnny Zhang <[email protected]>
> wrote:
>
> > seems a HDFS issue, as you said, cannot retrieval certain block from
> > certain DN. Can you check the health of all DN? And properly also bump
> the
> > log4j level to DEBUG.
> >
> > Johnny
> >
> >
> > On Fri, Apr 12, 2013 at 12:06 PM, Mua Ban <[email protected]> wrote:
> >
> > > Thank you very much Cheolsoo,
> > >
> > > I am running the script once more right now and I see 7 failed reducers
> > at
> > > the moment on the job tracker GUI. I browse these failed reducers and I
> > > found the task logs. From these 7 failed reducers, some have type 1
> task
> > > log, the rest have type 2 task log as I show below.
> > >
> > > They seem related to some connection issue among nodes in the cluster.
> Do
> > > you know any parameters I should configure to figure out the actual
> > > problem?
> > >
> > > Thank you,
> > > -Mua
> > >
> > > ---------------------------------------
> > > *Type 1 task log*
> > >
> > > 3-04-12 13:42:24,960 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 5 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:25,259 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:25,271 INFO org.apache.hadoop.mapred.ReduceTask:
> > > Initiating in-memory merge with 610 segments...
> > > 2013-04-12 13:42:25,273 INFO org.apache.hadoop.mapred.Merger: Merging
> 610
> > > sorted segments
> > > 2013-04-12 13:42:25,275 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 610 segments left of total size: 96922927 bytes
> > > 2013-04-12 13:42:27,348 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Merge of the 610 files in-memory
> > > complete. Local file is
> > >
> > >
> >
> /hdfs/sp/filesystem/mapred/local/taskTracker/vul/jobcache/job_201304081613_0049/attempt_201304081613_0049_r_000009_0/output/map_6.out
> > > of size 96921713
> > > 2013-04-12 13:42:27,349 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Thread waiting: Thread for merging
> > > on-disk files
> > > 2013-04-12 13:42:30,263 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:35,267 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:38,145 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000584_0'
> > > 2013-04-12 13:42:44,150 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000557_0'
> > > 2013-04-12 13:42:55,283 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:43:05,164 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000604_0'
> > > 2013-04-12 13:43:06,036 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:43:11,169 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000597_1'
> > > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:44:21,042 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:44:21,043 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:44:29,222 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000576_0'
> > > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 4 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:01,334 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:06,341 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 2 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:41,301 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000616_1'
> > > 2013-04-12 13:46:41,351 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:42,301 INFO org.apache.hadoop.mapred.ReduceTask:
> > > GetMapEventsThread exiting
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > > getMapsEventsThread joined.
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> Closed
> > > ram manager
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > > Interleaved on-disk merge complete: 1 files left.
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > In-memory
> > > merge complete: 11 files left.
> > > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Merging
> 11
> > > sorted segments
> > > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 11 segments left of total size: 3152550 bytes
> > > 2013-04-12 13:46:42,393 INFO org.apache.hadoop.mapred.ReduceTask:
> Merged
> > 11
> > > segments, 3152550 bytes to disk to satisfy reduce memory limit
> > > 2013-04-12 13:46:42,394 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 2
> > > files, 100074247 bytes from disk
> > > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 0
> > > segments, 0 bytes from memory into reduce
> > > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > > sorted segments
> > > 2013-04-12 13:46:42,398 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 2 segments left of total size: 100074239 bytes
> > > 2013-04-12 13:57:45,872 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for block
> > > blk_-199210310173610155_28360java.net.SocketTimeoutException: 69000
> > millis
> > > timeout while waiting for channel to be ready for read. ch :
> > > java.nio.channels.SocketChannel[connected local=/10.6.25.33:47987
> remote=/
> > > 10.6.25.33:49197]
> > > at
> > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> > > at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > > at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > > at java.io.DataInputStream.readFully(DataInputStream.java:189)
> > > at java.io.DataInputStream.readLong(DataInputStream.java:410)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)
> > >
> > > 2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-199210310173610155_28360 bad datanode[0]
> > > 10.6.25.33:49197
> > > 2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-199210310173610155_28360 in pipeline
> > > 10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: bad datanode
> > > 10.6.25.33:49197
> > > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> > in
> > > createBlockOutputStream 10.6.25.33:49197 java.io.IOException: Bad
> > connect
> > > ack with firstBadLink as 10.6.25.32:53741
> > > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> > > block blk_-8936348770928346278_28365
> > > 2013-04-12 14:04:55,907 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> > > datanode 10.6.25.32:53741
> > > 2013-04-12 14:06:07,789 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for block
> > > blk_8322203915584568195_28367java.io.IOException: Bad response 1 for
> > block
> > > blk_8322203915584568195_28367 from datanode 10.6.25.31:54563
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
> > >
> > > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_8322203915584568195_28367 bad datanode[2]
> > > 10.6.25.31:54563
> > > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_8322203915584568195_28367 in pipeline
> > > 10.6.25.33:49197, 10.6.25.145:48897, 10.6.25.31:54563: bad datanode
> > > 10.6.25.31:54563
> > > 2013-04-12 14:06:45,112 WARN org.apache.hadoop.mapred.Task: Parent
> died.
> > >  Exiting attempt_201304081613_0049_r_000009_0
> > >
> > >
> > > ----------------------------------------------------------
> > >
> > > *Type 2 task log*
> > >
> > >
> > > 14_1'
> > > 2013-04-11 15:24:33,168 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000609_1'
> > > 2013-04-11 15:24:39,172 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000599_0'
> > > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Need another 5 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:25:24,203 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000577_0'
> > > 2013-04-11 15:25:51,529 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:26:00,227 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000556_1'
> > > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Need another 3 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 2 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000576_0'
> > > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000560_1'
> > > 2013-04-11 15:26:06,603 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > > GetMapEventsThread exiting
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > > getMapsEventsThread joined.
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> Closed
> > > ram manager
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > > Interleaved on-disk merge complete: 1 files left.
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > In-memory
> > > merge complete: 109 files left.
> > > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Merging
> 109
> > > sorted segments
> > > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 109 segments left of total size: 23323822 bytes
> > > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask:
> Merged
> > > 109 segments, 23323822 bytes to disk to satisfy reduce memory limit
> > > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 2
> > > files, 120503030 bytes from disk
> > > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 0
> > > segments, 0 bytes from memory into reduce
> > > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > > sorted segments
> > > 2013-04-11 15:26:07,531 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 2 segments left of total size: 120503022 bytes
> > > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> > in
> > > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> > connect
> > > ack with firstBadLink as 10.6.25.32:53741
> > > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> > > block blk_423162934996020555_28131
> > > 2013-04-11 15:28:34,123 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> > > datanode 10.6.25.32:53741
> > > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> > in
> > > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> > connect
> > > ack with firstBadLink as 10.6.25.31:54563
> > > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> > > block blk_-7269648817395135125_28148
> > > 2013-04-11 15:58:06,152 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> > > datanode 10.6.25.31:54563
> > > 2013-04-11 16:06:10,175 WARN org.apache.hadoop.mapred.Task: Parent
> died.
> > >  Exiting attempt_201304081613_0046_r_000010_0
> > >
> > >
> > >
> > > On Fri, Apr 12, 2013 at 2:25 PM, Cheolsoo Park <[email protected]>
> > > wrote:
> > >
> > > > Hi Mua,
> > > >
> > > > I guess you misunderstood me. The pig_****.log file is not a task
> log.
> > > >
> > > > You should look for task logs on data nodes where your task tracker
> > ran.
> > > > Here is some explanation regarding various log files in Hadoop and
> > where
> > > to
> > > > find them:
> > > >
> > > >
> > > >
> > >
> >
> http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/
> > > >
> > > > Thanks,
> > > > Cheolsoo
> > > >
> > > >
> > > > On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <[email protected]>
> > wrote:
> > > >
> > > > > Thank you very much for your reply.
> > > > >
> > > > > Below is the stack log in the pig_****.log file
> > > > >
> > > > > Can you please give me some suggestions?
> > > > >
> > > > > -Mua
> > > > > ------------------
> > > > > Backend error message
> > > > > ---------------------
> > > > > Task attempt_201304081613_0048_r_000001_0 failed to report status
> for
> > > 601
> > > > > seconds. Killing!
> > > > >
> > > > > Pig Stack Trace
> > > > > ---------------
> > > > > ERROR 2997: Unable to recreate exception from backed error: Task
> > > > > attempt_201304081613_0048_r_000001_0 failed to report status for
> 601
> > > > > seconds. Killing!
> > > > >
> > > > > org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
> > > Unable
> > > > to
> > > > > recreate exception from backed error: Task
> > > > > attempt_201304081613_0048_r_000001_0 failed to report status for
> 601
> > > > > seconds. Killing!
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
> > > > >         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
> > > > >         at
> > > > >
> > >
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
> > > > >         at org.apache.pig.PigServer.execute(PigServer.java:1245)
> > > > >         at
> org.apache.pig.PigServer.executeBatch(PigServer.java:362)
> > > > >         at
> > > > >
> > >
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> > > > >         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> > > > >         at org.apache.pig.Main.run(Main.java:555)
> > > > >         at org.apache.pig.Main.main(Main.java:111)
> > > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> > > > >         at java.lang.reflect.Method.invoke(Method.java:611)
> > > > >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > > >
> > > > >
> > > >
> > >
> >
> ================================================================================
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Did you look at task logs to see why those tasks failed? Since
> > it's a
> > > > > > back-end error, the console output doesn't tell you much. Task
> logs
> > > > > should
> > > > > > have a stack trace that shows why it failed, and you can go from
> > > there.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am very new to PIG/Hadoop, I just started writing my first
> PIG
> > > > > script a
> > > > > > > couple days ago. I ran into this problem.
> > > > > > >
> > > > > > > My cluster has 9 nodes. I have to join two data sets big and
> > small,
> > > > > each
> > > > > > is
> > > > > > > collected for 4 weeks. I first take two subsets of my data set
> > > (which
> > > > > is
> > > > > > > for the first week of data), let's call them B1 and S1 for big
> > and
> > > > > small
> > > > > > > data sets of the first week. The entire data sets of 4 weeks is
> > B4
> > > > and
> > > > > > S4.
> > > > > > >
> > > > > > > I ran my script on my cluster to join B1 and S1 and everything
> is
> > > > > fine. I
> > > > > > > got my joined data. However, when I ran my script to join B4
> and
> > > S4,
> > > > > the
> > > > > > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id
> > > appears
> > > > > > more
> > > > > > > frequently than others. I tried both 'using skewed' and 'using
> > > > > > replicated'
> > > > > > > modes for the join operation (by appending them to the end of
> the
> > > > below
> > > > > > > join clause), they both fail.
> > > > > > >
> > > > > > > Here is my script and i think it is very simple:
> > > > > > >
> > > > > > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > > > > > data:chararray);*
> > > > > > > *small = load 'smalldir/' using PigStorage(',') as
> > > > > > > (t1:double,t2:double,data:chararray,id:chararray);
> > > > > > > *
> > > > > > > *J = JOIN big by id LEFT OUTER, small by id;
> > > > > > > *
> > > > > > > *store J into 'outputdir' using PigStorage(',');
> > > > > > > *
> > > > > > >
> > > > > > > On the web ui of the tracker, I see that the job has 40
> reducers
> > (I
> > > > > guess
> > > > > > > since the total data is about 40GB, and each 1GB will need one
> > > > reducer
> > > > > by
> > > > > > > default of PIG and hadoop setting, so this is normal). If I use
> > > > > 'parallel
> > > > > > > 80' in the join operation above, then I see 80 reducers, and
> the
> > > join
> > > > > > > operation still failed.
> > > > > > >
> > > > > > > I checked file  mapred-default.xml and found this:
> > > > > > > <name>mapred.reduce.tasks</name>
> > > > > > >   <value>1</value>
> > > > > > >
> > > > > > > If I set the value of parallel in join operation, it should
> > > overwrite
> > > > > > this,
> > > > > > > right?
> > > > > > >
> > > > > > >
> > > > > > > On the tracker GUI, I see that for different runs, the number
> of
> > > > > > completed
> > > > > > > reducers changes from 4 to 10 (out of 40 total reducers). The
> > > tracker
> > > > > GUI
> > > > > > > shows the reason for the failed reducers: "Task
> > > > > > > attempt_201304081613_0046_r_000006_0 failed to report status
> for
> > > 600
> > > > > > > seconds. Killing!"
> > > > > > >
> > > > > > > *Could you please help?*
> > > > > > > Thank you very much,
> > > > > > > -Mua
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > > > > > Here is the error report from the console screen where I ran
> this
> > > > > script:
> > > > > > >
> > > > > > > job_201304081613_0032   616     0       230     12      32
>  0
> > >   0
> > > > > > > 0       big     MAP_ONLY
> > > > > > > job_201304081613_0033   705     1       21      6       6
> > > 234 2
> > > > > > > 34      234             SAMPLER
> > > > > > >
> > > > > > > Failed Jobs:
> > > > > > > JobId   Alias   Feature Message Outputs
> > > > > > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job
> > > failed!
> > > > > > > Error - # of failed Reduce Tasks exceeded allowed limit.
> > > FailedCount:
> > > > > 1.
> > > > > > > LastFailedTask: task_201304081613_0034_r_000012
> > > > > > >
> > > > > > > Input(s):
> > > > > > > Successfully read 364285458 records (39528533645 bytes) from:
> > > > > > > "hdfs://d0521b01:24990/user/abc/big/"
> > > > > > > Failed to read data from
> "hdfs://d0521b01:24990/user/abc/small/"
> > > > > > >
> > > > > > > Output(s):
> > > > > > >
> > > > > > > Counters:
> > > > > > > Total records written : 0
> > > > > > > Total bytes written : 0
> > > > > > > Spillable Memory Manager spill count : 0
> > > > > > > Total bags proactively spilled: 0
> > > > > > > Total records proactively spilled: 0
> > > > > > >
> > > > > > > Job DAG:
> > > > > > > job_201304081613_0032   ->      job_201304081613_0033,
> > > > > > > job_201304081613_0033   ->      job_201304081613_0034,
> > > > > > > job_201304081613_0034   ->      null,
> > > > > > > null
> > > > > > >
> > > > > > >
> > > > > > > 2013-04-10 20:11:23,815 [main] WARN
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > > - Encountered Warning
> > > > > > > REDUCER_COUNT_LOW 1 time(s).
> > > > > > > 2013-04-10 20:11:23,815 [main] INFO
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > > - Some jobs have faile
> > > > > > > d! Stop running all dependent jobs
> > > > > > > 2013-04-10 20:11:23,815 [main] ERROR
> > > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > > > > > ror Recovery for block blk_312487981794332936_26563 failed
> >  because
> > > > > > > recovery from primary datanode 10.6.25.31:54563 failed 6
> times.
> > > >  Pipel
> > > > > > > ine was 10.6.25.31:54563. Aborting...
> > > > > > > Details at logfile:
> > > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > > > 2013-04-10 20:11:23,818 [main] ERROR
> > > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > > > > > sage
> > > > > > > Details at logfile:
> > > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: join operation fails on big data set

Reply via email to