Did you look at task logs to see why those tasks failed? Since it's a back-end error, the console output doesn't tell you much. Task logs should have a stack trace that shows why it failed, and you can go from there.
On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <[email protected]> wrote: > Hi, > > I am very new to PIG/Hadoop, I just started writing my first PIG script a > couple days ago. I ran into this problem. > > My cluster has 9 nodes. I have to join two data sets big and small, each is > collected for 4 weeks. I first take two subsets of my data set (which is > for the first week of data), let's call them B1 and S1 for big and small > data sets of the first week. The entire data sets of 4 weeks is B4 and S4. > > I ran my script on my cluster to join B1 and S1 and everything is fine. I > got my joined data. However, when I ran my script to join B4 and S4, the > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears more > frequently than others. I tried both 'using skewed' and 'using replicated' > modes for the join operation (by appending them to the end of the below > join clause), they both fail. > > Here is my script and i think it is very simple: > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray, > data:chararray);* > *small = load 'smalldir/' using PigStorage(',') as > (t1:double,t2:double,data:chararray,id:chararray); > * > *J = JOIN big by id LEFT OUTER, small by id; > * > *store J into 'outputdir' using PigStorage(','); > * > > On the web ui of the tracker, I see that the job has 40 reducers (I guess > since the total data is about 40GB, and each 1GB will need one reducer by > default of PIG and hadoop setting, so this is normal). If I use 'parallel > 80' in the join operation above, then I see 80 reducers, and the join > operation still failed. > > I checked file mapred-default.xml and found this: > <name>mapred.reduce.tasks</name> > <value>1</value> > > If I set the value of parallel in join operation, it should overwrite this, > right? > > > On the tracker GUI, I see that for different runs, the number of completed > reducers changes from 4 to 10 (out of 40 total reducers). The tracker GUI > shows the reason for the failed reducers: "Task > attempt_201304081613_0046_r_000006_0 failed to report status for 600 > seconds. Killing!" > > *Could you please help?* > Thank you very much, > -Mua > > > -------------------------------------------------------------------------------------------------------------- > Here is the error report from the console screen where I ran this script: > > job_201304081613_0032 616 0 230 12 32 0 0 > 0 big MAP_ONLY > job_201304081613_0033 705 1 21 6 6 234 2 > 34 234 SAMPLER > > Failed Jobs: > JobId Alias Feature Message Outputs > job_201304081613_0034 small SKEWED_JOIN Message: Job failed! > Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. > LastFailedTask: task_201304081613_0034_r_000012 > > Input(s): > Successfully read 364285458 records (39528533645 bytes) from: > "hdfs://d0521b01:24990/user/abc/big/" > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/" > > Output(s): > > Counters: > Total records written : 0 > Total bytes written : 0 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > Job DAG: > job_201304081613_0032 -> job_201304081613_0033, > job_201304081613_0033 -> job_201304081613_0034, > job_201304081613_0034 -> null, > null > > > 2013-04-10 20:11:23,815 [main] WARN > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Encountered Warning > REDUCER_COUNT_LOW 1 time(s). > 2013-04-10 20:11:23,815 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Some jobs have faile > d! Stop running all dependent jobs > 2013-04-10 20:11:23,815 [main] ERROR org.apache.pig.tools.grunt.GruntParser > - ERROR 2997: Encountered IOException. java.io.IOException: Er > ror Recovery for block blk_312487981794332936_26563 failed because > recovery from primary datanode 10.6.25.31:54563 failed 6 times. Pipel > ine was 10.6.25.31:54563. Aborting... > Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log > 2013-04-10 20:11:23,818 [main] ERROR org.apache.pig.tools.grunt.GruntParser > - ERROR 2244: Job failed, hadoop does not return any error mes > sage > Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log >
