Re: join operation fails on big data set

Mua Ban Sun, 14 Apr 2013 06:14:27 -0700

Hi Johnny,

Thank you very much for your email. I am very new here. Please tell me
where to check the health of the data node (which log file should I look
at?), and how to set the logging level of log4j to DEBUG.


Thanks,
-Mua


On Fri, Apr 12, 2013 at 5:01 PM, Johnny Zhang <[email protected]> wrote:

> seems a HDFS issue, as you said, cannot retrieval certain block from
> certain DN. Can you check the health of all DN? And properly also bump the
> log4j level to DEBUG.
>
> Johnny
>
>
> On Fri, Apr 12, 2013 at 12:06 PM, Mua Ban <[email protected]> wrote:
>
> > Thank you very much Cheolsoo,
> >
> > I am running the script once more right now and I see 7 failed reducers
> at
> > the moment on the job tracker GUI. I browse these failed reducers and I
> > found the task logs. From these 7 failed reducers, some have type 1 task
> > log, the rest have type 2 task log as I show below.
> >
> > They seem related to some connection issue among nodes in the cluster. Do
> > you know any parameters I should configure to figure out the actual
> > problem?
> >
> > Thank you,
> > -Mua
> >
> > ---------------------------------------
> > *Type 1 task log*
> >
> > 3-04-12 13:42:24,960 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 5 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:25,259 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:25,271 INFO org.apache.hadoop.mapred.ReduceTask:
> > Initiating in-memory merge with 610 segments...
> > 2013-04-12 13:42:25,273 INFO org.apache.hadoop.mapred.Merger: Merging 610
> > sorted segments
> > 2013-04-12 13:42:25,275 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 610 segments left of total size: 96922927 bytes
> > 2013-04-12 13:42:27,348 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Merge of the 610 files in-memory
> > complete. Local file is
> >
> >
> /hdfs/sp/filesystem/mapred/local/taskTracker/vul/jobcache/job_201304081613_0049/attempt_201304081613_0049_r_000009_0/output/map_6.out
> > of size 96921713
> > 2013-04-12 13:42:27,349 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Thread waiting: Thread for merging
> > on-disk files
> > 2013-04-12 13:42:30,263 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:35,267 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:38,145 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000584_0'
> > 2013-04-12 13:42:44,150 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000557_0'
> > 2013-04-12 13:42:55,283 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:43:05,164 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000604_0'
> > 2013-04-12 13:43:06,036 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:43:11,169 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000597_1'
> > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:44:21,042 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:44:21,043 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:44:29,222 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000576_0'
> > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 4 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:01,334 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:06,341 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 2 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:41,301 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000616_1'
> > 2013-04-12 13:46:41,351 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:42,301 INFO org.apache.hadoop.mapred.ReduceTask:
> > GetMapEventsThread exiting
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > getMapsEventsThread joined.
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask: Closed
> > ram manager
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > Interleaved on-disk merge complete: 1 files left.
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> In-memory
> > merge complete: 11 files left.
> > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Merging 11
> > sorted segments
> > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 11 segments left of total size: 3152550 bytes
> > 2013-04-12 13:46:42,393 INFO org.apache.hadoop.mapred.ReduceTask: Merged
> 11
> > segments, 3152550 bytes to disk to satisfy reduce memory limit
> > 2013-04-12 13:46:42,394 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 2
> > files, 100074247 bytes from disk
> > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 0
> > segments, 0 bytes from memory into reduce
> > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > sorted segments
> > 2013-04-12 13:46:42,398 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 2 segments left of total size: 100074239 bytes
> > 2013-04-12 13:57:45,872 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-199210310173610155_28360java.net.SocketTimeoutException: 69000
> millis
> > timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel[connected local=/10.6.25.33:47987remote=/
> > 10.6.25.33:49197]
> > at
> >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> > at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > at java.io.DataInputStream.readFully(DataInputStream.java:189)
> > at java.io.DataInputStream.readLong(DataInputStream.java:410)
> > at
> >
> >
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
> > at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)
> >
> > 2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-199210310173610155_28360 bad datanode[0]
> > 10.6.25.33:49197
> > 2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-199210310173610155_28360 in pipeline
> > 10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: bad datanode
> > 10.6.25.33:49197
> > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream 10.6.25.33:49197 java.io.IOException: Bad
> connect
> > ack with firstBadLink as 10.6.25.32:53741
> > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_-8936348770928346278_28365
> > 2013-04-12 14:04:55,907 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> > datanode 10.6.25.32:53741
> > 2013-04-12 14:06:07,789 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_8322203915584568195_28367java.io.IOException: Bad response 1 for
> block
> > blk_8322203915584568195_28367 from datanode 10.6.25.31:54563
> > at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
> >
> > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_8322203915584568195_28367 bad datanode[2]
> > 10.6.25.31:54563
> > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_8322203915584568195_28367 in pipeline
> > 10.6.25.33:49197, 10.6.25.145:48897, 10.6.25.31:54563: bad datanode
> > 10.6.25.31:54563
> > 2013-04-12 14:06:45,112 WARN org.apache.hadoop.mapred.Task: Parent died.
> >  Exiting attempt_201304081613_0049_r_000009_0
> >
> >
> > ----------------------------------------------------------
> >
> > *Type 2 task log*
> >
> >
> > 14_1'
> > 2013-04-11 15:24:33,168 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000609_1'
> > 2013-04-11 15:24:39,172 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000599_0'
> > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Need another 5 map output(s) where 0
> > is already in progress
> > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:25:24,203 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000577_0'
> > 2013-04-11 15:25:51,529 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:26:00,227 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000556_1'
> > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Need another 3 map output(s) where 0
> > is already in progress
> > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 2 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000576_0'
> > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000560_1'
> > 2013-04-11 15:26:06,603 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > GetMapEventsThread exiting
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > getMapsEventsThread joined.
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask: Closed
> > ram manager
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > Interleaved on-disk merge complete: 1 files left.
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> In-memory
> > merge complete: 109 files left.
> > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Merging 109
> > sorted segments
> > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 109 segments left of total size: 23323822 bytes
> > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask: Merged
> > 109 segments, 23323822 bytes to disk to satisfy reduce memory limit
> > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 2
> > files, 120503030 bytes from disk
> > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 0
> > segments, 0 bytes from memory into reduce
> > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > sorted segments
> > 2013-04-11 15:26:07,531 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 2 segments left of total size: 120503022 bytes
> > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> connect
> > ack with firstBadLink as 10.6.25.32:53741
> > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_423162934996020555_28131
> > 2013-04-11 15:28:34,123 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> > datanode 10.6.25.32:53741
> > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> connect
> > ack with firstBadLink as 10.6.25.31:54563
> > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_-7269648817395135125_28148
> > 2013-04-11 15:58:06,152 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> > datanode 10.6.25.31:54563
> > 2013-04-11 16:06:10,175 WARN org.apache.hadoop.mapred.Task: Parent died.
> >  Exiting attempt_201304081613_0046_r_000010_0
> >
> >
> >
> > On Fri, Apr 12, 2013 at 2:25 PM, Cheolsoo Park <[email protected]>
> > wrote:
> >
> > > Hi Mua,
> > >
> > > I guess you misunderstood me. The pig_****.log file is not a task log.
> > >
> > > You should look for task logs on data nodes where your task tracker
> ran.
> > > Here is some explanation regarding various log files in Hadoop and
> where
> > to
> > > find them:
> > >
> > >
> > >
> >
> http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/
> > >
> > > Thanks,
> > > Cheolsoo
> > >
> > >
> > > On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <[email protected]>
> wrote:
> > >
> > > > Thank you very much for your reply.
> > > >
> > > > Below is the stack log in the pig_****.log file
> > > >
> > > > Can you please give me some suggestions?
> > > >
> > > > -Mua
> > > > ------------------
> > > > Backend error message
> > > > ---------------------
> > > > Task attempt_201304081613_0048_r_000001_0 failed to report status for
> > 601
> > > > seconds. Killing!
> > > >
> > > > Pig Stack Trace
> > > > ---------------
> > > > ERROR 2997: Unable to recreate exception from backed error: Task
> > > > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > > > seconds. Killing!
> > > >
> > > > org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
> > Unable
> > > to
> > > > recreate exception from backed error: Task
> > > > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > > > seconds. Killing!
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
> > > >         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
> > > >         at
> > > >
> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
> > > >         at org.apache.pig.PigServer.execute(PigServer.java:1245)
> > > >         at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
> > > >         at
> > > >
> > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> > > >         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> > > >         at org.apache.pig.Main.run(Main.java:555)
> > > >         at org.apache.pig.Main.main(Main.java:111)
> > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > >         at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> > > >         at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> > > >         at java.lang.reflect.Method.invoke(Method.java:611)
> > > >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > >
> > > >
> > >
> >
> ================================================================================
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <
> [email protected]
> > > > >wrote:
> > > >
> > > > > Did you look at task logs to see why those tasks failed? Since
> it's a
> > > > > back-end error, the console output doesn't tell you much. Task logs
> > > > should
> > > > > have a stack trace that shows why it failed, and you can go from
> > there.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am very new to PIG/Hadoop, I just started writing my first PIG
> > > > script a
> > > > > > couple days ago. I ran into this problem.
> > > > > >
> > > > > > My cluster has 9 nodes. I have to join two data sets big and
> small,
> > > > each
> > > > > is
> > > > > > collected for 4 weeks. I first take two subsets of my data set
> > (which
> > > > is
> > > > > > for the first week of data), let's call them B1 and S1 for big
> and
> > > > small
> > > > > > data sets of the first week. The entire data sets of 4 weeks is
> B4
> > > and
> > > > > S4.
> > > > > >
> > > > > > I ran my script on my cluster to join B1 and S1 and everything is
> > > > fine. I
> > > > > > got my joined data. However, when I ran my script to join B4 and
> > S4,
> > > > the
> > > > > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id
> > appears
> > > > > more
> > > > > > frequently than others. I tried both 'using skewed' and 'using
> > > > > replicated'
> > > > > > modes for the join operation (by appending them to the end of the
> > > below
> > > > > > join clause), they both fail.
> > > > > >
> > > > > > Here is my script and i think it is very simple:
> > > > > >
> > > > > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > > > > data:chararray);*
> > > > > > *small = load 'smalldir/' using PigStorage(',') as
> > > > > > (t1:double,t2:double,data:chararray,id:chararray);
> > > > > > *
> > > > > > *J = JOIN big by id LEFT OUTER, small by id;
> > > > > > *
> > > > > > *store J into 'outputdir' using PigStorage(',');
> > > > > > *
> > > > > >
> > > > > > On the web ui of the tracker, I see that the job has 40 reducers
> (I
> > > > guess
> > > > > > since the total data is about 40GB, and each 1GB will need one
> > > reducer
> > > > by
> > > > > > default of PIG and hadoop setting, so this is normal). If I use
> > > > 'parallel
> > > > > > 80' in the join operation above, then I see 80 reducers, and the
> > join
> > > > > > operation still failed.
> > > > > >
> > > > > > I checked file  mapred-default.xml and found this:
> > > > > > <name>mapred.reduce.tasks</name>
> > > > > >   <value>1</value>
> > > > > >
> > > > > > If I set the value of parallel in join operation, it should
> > overwrite
> > > > > this,
> > > > > > right?
> > > > > >
> > > > > >
> > > > > > On the tracker GUI, I see that for different runs, the number of
> > > > > completed
> > > > > > reducers changes from 4 to 10 (out of 40 total reducers). The
> > tracker
> > > > GUI
> > > > > > shows the reason for the failed reducers: "Task
> > > > > > attempt_201304081613_0046_r_000006_0 failed to report status for
> > 600
> > > > > > seconds. Killing!"
> > > > > >
> > > > > > *Could you please help?*
> > > > > > Thank you very much,
> > > > > > -Mua
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > > > > Here is the error report from the console screen where I ran this
> > > > script:
> > > > > >
> > > > > > job_201304081613_0032   616     0       230     12      32      0
> >   0
> > > > > > 0       big     MAP_ONLY
> > > > > > job_201304081613_0033   705     1       21      6       6
> > 234 2
> > > > > > 34      234             SAMPLER
> > > > > >
> > > > > > Failed Jobs:
> > > > > > JobId   Alias   Feature Message Outputs
> > > > > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job
> > failed!
> > > > > > Error - # of failed Reduce Tasks exceeded allowed limit.
> > FailedCount:
> > > > 1.
> > > > > > LastFailedTask: task_201304081613_0034_r_000012
> > > > > >
> > > > > > Input(s):
> > > > > > Successfully read 364285458 records (39528533645 bytes) from:
> > > > > > "hdfs://d0521b01:24990/user/abc/big/"
> > > > > > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> > > > > >
> > > > > > Output(s):
> > > > > >
> > > > > > Counters:
> > > > > > Total records written : 0
> > > > > > Total bytes written : 0
> > > > > > Spillable Memory Manager spill count : 0
> > > > > > Total bags proactively spilled: 0
> > > > > > Total records proactively spilled: 0
> > > > > >
> > > > > > Job DAG:
> > > > > > job_201304081613_0032   ->      job_201304081613_0033,
> > > > > > job_201304081613_0033   ->      job_201304081613_0034,
> > > > > > job_201304081613_0034   ->      null,
> > > > > > null
> > > > > >
> > > > > >
> > > > > > 2013-04-10 20:11:23,815 [main] WARN
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > - Encountered Warning
> > > > > > REDUCER_COUNT_LOW 1 time(s).
> > > > > > 2013-04-10 20:11:23,815 [main] INFO
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > - Some jobs have faile
> > > > > > d! Stop running all dependent jobs
> > > > > > 2013-04-10 20:11:23,815 [main] ERROR
> > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > > > > ror Recovery for block blk_312487981794332936_26563 failed
>  because
> > > > > > recovery from primary datanode 10.6.25.31:54563 failed 6 times.
> > >  Pipel
> > > > > > ine was 10.6.25.31:54563. Aborting...
> > > > > > Details at logfile:
> > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > > 2013-04-10 20:11:23,818 [main] ERROR
> > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > > > > sage
> > > > > > Details at logfile:
> > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: join operation fails on big data set

Reply via email to