Hadoop streaming - No room for reduce task error

Scott Wed, 10 Jun 2009 09:40:51 -0700

Complete newby map/reduce question here. I am using hadoop streaming asI come from a Perl background, and am trying to prototype/test a processto load/clean-up ad server log lines from multiple input files into onelarge file on the hdfs that can then be used as the source of a hive dbtable.I have a perl map script that reads an input line from stdin, does theneeded cleanup/manipulation, and writes back to stdout. I don'treally need a reduce step, as I don't care what order the lines arewritten in, and there is no summary data to produce. When I run the jobwith -reducer NONE I get valid output, however I get multiple part-xxxxxfiles rather than one big file.So I wrote a trivial 'reduce' script that reads from stdin and simplysplits the key/value, and writes the value back to stdout.


I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input/logs/*.log -output test9

The code I have works when given a small set of input files. However, Iget the following error when attempting to run the code on a large setof input files:

hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-0915:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room forreduce task. Nodetracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920bytes free; but we expect reduce input to take 22138478392

I assume this is because the all the map output is being buffered inmemory prior to running the reduce step? If so, what can I change tostop the buffering? I just need the map output to go directly to onelarge file.


Thanks,
Scott

Hadoop streaming - No room for reduce task error

Reply via email to