Hi all, I run into task failures if I run several jobs on my 10 node cluster. I start seeing warnings of the following type before the job fails.
WARN mapred.JobClient: Error reading task outputhttp://<machine.domainname>:50060/tasklog?plaintext=true&taskid=attempt_201001221644_0001_r_000001_2&filter=stdout INFO mapred.JobClient: Task Id : attempt_201001221644_0001_r_000001_2, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) After doing a search on the mailing list, I found some information stating that this could be due to DNS name resolution failures. It was suggested that in the /etc/hosts file, one should add the IP addresses (for e,g, 127.0.0.1 machine.domainname) to make sure that the jobtracker and the tasktracker can locate each other. I did that but still the problem occurs if I run too many jobs. I believe I am running into DNS resolution quotas somewhere because my cluster does not have a local DNS server and contacts the university servers for name resolution. When this problem occurs restarting the cluster did not help and the last time the problem went away after 24 hours (I am assuming the admins replenish the quotas daily). My questions are: 1) Why is hadoop looking for http://<machine.domainname>:port... instead of http://ipaddress:port even when I provide IP addresses in the /etc/hosts as well as the conf/slaves file? 2) Has anyone faced similar problems? How did you resolve it? I understand that the problem is not directly related to Hadoop but the way Linux does DNS name resolution (and the way things are set up on my end). To the best of my knowledge, I see the problem as my hadoop jobs generating a certain number of DNS queries that exhaust the allowed DNS query quota over time. How do I reduce the number of these queries in Hadoop? Thanks, Abhishek