The difference in the command is where the shell script is coming from.  If you 
are using ~/mapper.sh then it will look in your home directory to run the 
script.  If you have a small cluster with your home directory mounted on all of 
them then it is not that big of a deal.  If you have a large cluster then the 
NFS mounting the directory on all of the boxes can cause a lot of issues.  If 
you have a large cluster you should use the distributed cache to send it over 
(you are already sending it through the distributed cache by using the -file 
option).

I am not completely sure why it would be timing out.  Are all of them timing 
out, or is it just a single mapper that is timing out.  One thing you can do it 
to run your streaming job, but with echo instead of mapper.sh, then you can use 
that as input to the command running on your local box.

./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/mapper.sh 
-mapper echo -input ../foo.txt -output output
./hadoop fs -cat output/part-00000 | ~/mapper.sh

#or pick a different part file that corresponds to the mapper task that is 
timing out.

--Bobby Evans

On 10/7/11 1:43 AM, "Aishwarya Venkataraman" <avenk...@cs.ucsd.edu> wrote:

Robert,

My mapper job fails. I am basically trying to run a crawler on hadoop and
hadoop kills the crawler (mapper) if it has not heard from it for a certain
timeout period. But I already have a timeout set in my mapper(500 seconds)
which is lesser than hadoop's timeout(900 seconds). The mapper just stalls
for some reason. My mapper code is as follows:

while read line;do
  result="`wget -O - --timeout=500 http://$line 2>&1`"
  echo $result
done

Any idea why my mapper is getting stalled ?

I don't see the difference between the command you have given and the one I
ran. I am not running in local mode. Is there some way by which I can get
intermediate mapper outputs ? I would like to see for which site the mapper
is getting stalled.

Thanks,
Aishwarya

On Thu, Oct 6, 2011 at 1:41 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Alshwarya,
>
> Are you running in local mode?  If not you probably want to run
>
> hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
> ~/mapper.sh -mapper ./mapper.sh -input ../foo.txt -output output
>
> You may also want to run hadoop fs -ls output/* to see what files were
> produced.  If your mappers failed for some reason then there will be no
> files in the output directory. And you may want to look at the stderr logs
> for your processes through the web UI.
>
> --Bobby Evans
>
> On 10/6/11 3:30 PM, "Aishwarya Venkataraman" <avenk...@cs.ucsd.edu> wrote:
>
> I ran the following (I am using IdentityReducer) :
>
> ./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
> ~/mapper.sh -mapper ~/mapper.sh -input ../foo.txt -output output
>
> When I do
> ./hadoop dfs -cat output/* I do not see any output on screen. Is this how I
> view the output of mapper ?
>
> Thanks,
> AIshwarya
>
> On Thu, Oct 6, 2011 at 12:37 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
> > A streaming jobs stderr is logged for the task, but its stdout is what is
> > sent to the reducer.  The simplest way to get it is to turn off the
> > reducers, and then look at the output in HDFS.
> >
> > --Bobby Evans
> >
> > On 10/6/11 1:16 PM, "Aishwarya Venkataraman" <avenk...@cs.ucsd.edu>
> wrote:
> >
> > Hello,
> >
> > I want to view the mapper output for a given hadoop streaming jobs (that
> > runs a shell script). However I am not able to find this in any log
> files.
> > Where should I look for this ?
> >
> > Thanks,
> > Aishwarya
> >
> >
>
>
> --
> Thanks,
> Aishwarya Venkataraman
> avenk...@cs.ucsd.edu
> Graduate Student | Department of Computer Science
> University of California, San Diego
>
>


--
Thanks,
Aishwarya Venkataraman
avenk...@cs.ucsd.edu
Graduate Student | Department of Computer Science
University of California, San Diego

Reply via email to