I am trying to learn Hadoop and a lot of questions come to my mind when I try to learn it. So, I will be asking a few questions here from time to time until I feel completely comfortable with it. Here are some questions now:
1. Is it true that Hadoop should be installed on the same location on all Linux machines? As per what I have understood, it is necessary to install them on the same machine on all nodes as if I am going to use bin/start-dfs.sh and bin/start-mapred.sh to start the data nodes and task trackers on all slaves. Otherwise, it is not required. How correct I am? 2. Say, a slave goes down (due to network problems or power cut) while a word count job was going on. When it comes up again, what are the tasks I need to do? bin/hadoop-daemon.sh start datanode and bin/hadoop-daemon.sh start tasktracker is enough for recovery? Do, I have to delete any /tmp/hadoop-hadoop directories before starting? Is it guaranteed that on starting, any corrupt files in tmp directory would be discarded and everything would be restored to normalcy? 3. Say, I have 1 master and 4 slaves and I start datanode on 2 slaves and tasktracker on the other two. I put files in the HDFS. it means that the files would be stored in the first two datanodes. Then I run a word count job. This means that the word count jobs would run on the two task trackers. How would the two task trackers now get the files to do the word counting? In the documentations I was reading that the jobs are run on those nodes which have the data. but in this setup, the data nodes and job trackers are separate. So, how will the word count job do its work?
