Sean Curtis wrote:
in failed/killed task attempts, i see the following:


attempt_201012141048_0023_m_000000_0task_201012141048_0023_m_000000 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000000>172.24.10.91 <http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_0&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_0&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_0> attempt_201012141048_0023_m_000000_1task_201012141048_0023_m_000000 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000000>172.24.10.91 <http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_1&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_1&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_1> attempt_201012141048_0023_m_000001_0task_201012141048_0023_m_000001 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000001>172.24.10.91 <http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_0&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_0&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_0> attempt_201012141048_0023_m_000001_1task_201012141048_0023_m_000001 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000001>172.24.10.91 <http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_1&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_1&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_1> attempt_201012141048_0023_r_000000_0task_201012141048_0023_r_000000 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91 <http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.


The value you have in your hadoop-site.xml file for hadoop.tmp.dir will get you into trouble:
/tmp/hadoop/tmp/dir/hadoop-${user.name}

as many systems remove items from /tmp that are older than some time interval.

Are there any apparently relevant messages in the task tracker logs?

With two nodes and a small number of reducers, the tasktracker.http.threads change is unlikely to be part of the issue.

In general, the shuffle phase is simply transferring the sorted map outputs, to the reducer and merge sorting the results.

The errors tend to fall into two types.
Failed or blocked transfers.

Merge sort failures.

The failed or blocked transfers tend to be due to, to many requests at one time to a task tracker, and is controlled by tasktracker.http.threads which increases the number of requests that may be serviced.

or firewall issues that block the actual transfer.

The merge sort failures tend to be either: out of memory or out of disk space issues.

There are a couple of jira's open for shuffle errors for other cases.

http://issues.apache.org/jira/browse/HADOOP-3604
http://issues.apache.org/jira/browse/HADOOP-3155 < - likely cause fixed in the cloudera 0.18.3 distribution
http://issues.apache.org/jira/browse/HADOOP-4115
http://issues.apache.org/jira/browse/HADOOP-3130
http://issues.apache.org/jira/browse/HADOOP-2095


Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_0&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_0&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_0> attempt_201012141048_0023_r_000000_1task_201012141048_0023_r_000000 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91 <http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_1&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_1&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_1> attempt_201012141048_0023_r_000000_2task_201012141048_0023_r_000000 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91 <http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Last 4KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_2&start=-4097> Last 8KB <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_2&start=-8193> All <http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_2> attempt_201012141048_0023_r_000000_3task_201012141048_0023_r_000000 <http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91 <http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.



On Dec 20, 2010, at 11:01 PM, Adarsh Sharma wrote:

Sean Curtis wrote:
just running a simple select count(1) from a table (using movielens as an example) doesnt seem to work for me. anyone know why this doesnt work? im using hive trunk:

hive> select avg(rating) from movierating where movieid=43;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
 set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
 set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
 set mapred.reduce.tasks=<number>
Starting Job = job_201012141048_0023, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201012141048_0023 Kill Command = /Users/Sean/dev/hadoop-0.20.2+737/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201012141048_0023
2010-12-20 15:15:03,295 Stage-1 map = 0%,  reduce = 0%
2010-12-20 15:15:09,420 Stage-1 map = 50%,  reduce = 0%
... eventually fails after a couple of minutes with:

2010-12-20 17:33:01,113 Stage-1 map = 100%,  reduce = 0%
2010-12-20 17:33:32,182 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201012141048_0023 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
hive>

almost seems like the reduce task never starts. any help would be appreciated.

sean
To know the root cause of the problem, got to Jobtracker web UI ( IP:50030) and Check Job Tracker History at the bottom corresponding to this Job ID.


Best Regards

Adarsh Sharma


Reply via email to