Sean Curtis wrote:
in failed/killed task attempts, i see the following:
attempt_201012141048_0023_m_000000_0task_201012141048_0023_m_000000
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000000>172.24.10.91
<http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_0&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_0&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_0>
attempt_201012141048_0023_m_000000_1task_201012141048_0023_m_000000
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000000>172.24.10.91
<http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_1&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_1&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000000_1>
attempt_201012141048_0023_m_000001_0task_201012141048_0023_m_000001
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000001>172.24.10.91
<http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_0&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_0&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_0>
attempt_201012141048_0023_m_000001_1task_201012141048_0023_m_000001
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_m_000001>172.24.10.91
<http://172.24.10.91:50060>FAILED
Too many fetch-failures
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_1&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_1&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_m_000001_1>
attempt_201012141048_0023_r_000000_0task_201012141048_0023_r_000000
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91
<http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
The value you have in your hadoop-site.xml file for hadoop.tmp.dir will
get you into trouble:
/tmp/hadoop/tmp/dir/hadoop-${user.name}
as many systems remove items from /tmp that are older than some time
interval.
Are there any apparently relevant messages in the task tracker logs?
With two nodes and a small number of reducers, the
tasktracker.http.threads change is unlikely to be part of the issue.
In general, the shuffle phase is simply transferring the sorted map
outputs, to the reducer and merge sorting the results.
The errors tend to fall into two types.
Failed or blocked transfers.
Merge sort failures.
The failed or blocked transfers tend to be due to, to many requests at
one time to a task tracker, and is controlled by
tasktracker.http.threads which increases the number of requests that may
be serviced.
or firewall issues that block the actual transfer.
The merge sort failures tend to be either: out of memory or out of disk
space issues.
There are a couple of jira's open for shuffle errors for other cases.
http://issues.apache.org/jira/browse/HADOOP-3604
http://issues.apache.org/jira/browse/HADOOP-3155 < - likely cause fixed
in the cloudera 0.18.3 distribution
http://issues.apache.org/jira/browse/HADOOP-4115
http://issues.apache.org/jira/browse/HADOOP-3130
http://issues.apache.org/jira/browse/HADOOP-2095
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_0&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_0&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_0>
attempt_201012141048_0023_r_000000_1task_201012141048_0023_r_000000
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91
<http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_1&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_1&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_1>
attempt_201012141048_0023_r_000000_2task_201012141048_0023_r_000000
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91
<http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Last 4KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_2&start=-4097>
Last 8KB
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_2&start=-8193>
All
<http://172.24.10.91:50060/tasklog?attemptid=attempt_201012141048_0023_r_000000_2>
attempt_201012141048_0023_r_000000_3task_201012141048_0023_r_000000
<http://localhost:50030/taskdetails.jsp?tipid=task_201012141048_0023_r_000000>172.24.10.91
<http://172.24.10.91:50060>FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
On Dec 20, 2010, at 11:01 PM, Adarsh Sharma wrote:
Sean Curtis wrote:
just running a simple select count(1) from a table (using movielens
as an example) doesnt seem to work for me. anyone know why this
doesnt work? im using hive trunk:
hive> select avg(rating) from movierating where movieid=43;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201012141048_0023, Tracking URL =
http://localhost:50030/jobdetails.jsp?jobid=job_201012141048_0023
Kill Command = /Users/Sean/dev/hadoop-0.20.2+737/bin/../bin/hadoop
job -Dmapred.job.tracker=localhost:8021 -kill job_201012141048_0023
2010-12-20 15:15:03,295 Stage-1 map = 0%, reduce = 0%
2010-12-20 15:15:09,420 Stage-1 map = 50%, reduce = 0%
... eventually fails after a couple of minutes with:
2010-12-20 17:33:01,113 Stage-1 map = 100%, reduce = 0%
2010-12-20 17:33:32,182 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201012141048_0023 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask
hive>
almost seems like the reduce task never starts. any help would be
appreciated.
sean
To know the root cause of the problem, got to Jobtracker web UI (
IP:50030) and Check Job Tracker History at the bottom corresponding
to this Job ID.
Best Regards
Adarsh Sharma