I've noticed that task tracker moves all unpacked jars into
${hadoop.tmp.dir}/mapred/local/taskTracker.
We are using a lot of external libraries, that are deployed via "-libjars"
option. The total number of files after unpacking is about 20 thousands.
After running a number of jobs, tasks start to be killed with timeout reason
("Task attempt_200901281518_0011_m_000173_2 failed to report status for 601
seconds. Killing!"). All killed tasks are in "initializing" state. I've
watched the tasktracker logs and found such messages:
Thread 20926 (Thread-10368):
State: BLOCKED
Blocked count: 3611
Waited count: 24
Blocked on java.lang.ref.reference$l...@e48ed6
Blocked by 20882 (Thread-10341)
Stack:
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
java.lang.StringCoding.encode(StringCoding.java:272)
java.lang.String.getBytes(String.java:947)
java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
java.io.File.isDirectory(File.java:754)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:427)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
This is exactly as in HADOOP-4780.
As I understand, patch brings the code, which stores map of directories along
with their DU's, thus reducing the number of calls to DU. This must help but
the process of deleting 20000 files taks too long. I've manually deleted
archive after 10 jobs had run and it took over 30 minutes on XFS. Three times
more, that default timeout for tasks!
Is there is the way to prohibit unpacking of jars? Or at least not to hold the
archive? Or any other better way to solve this problem?
Hadoop version: 0.19.0.
--
Andrew Gudkov
PGP key id: CB9F07D8 (cryptonomicon.mit.edu)
Jabber: [email protected]