Complete newby map/reduce question here. I am using hadoop streaming as I come from a Perl background, and am trying to prototype/test a process to load/clean-up ad server log lines from multiple input files into one large file on the hdfs that can then be used as the source of a hive db table. I have a perl map script that reads an input line from stdin, does the needed cleanup/manipulation, and writes back to stdout. I don't really need a reduce step, as I don't care what order the lines are written in, and there is no summary data to produce. When I run the job with -reducer NONE I get valid output, however I get multiple part-xxxxx files rather than one big file. So I wrote a trivial 'reduce' script that reads from stdin and simply splits the key/value, and writes the value back to stdout.

I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input /logs/*.log -output test9

The code I have works when given a small set of input files. However, I get the following error when attempting to run the code on a large set of input files:

hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 bytes free; but we expect reduce input to take 22138478392

I assume this is because the all the map output is being buffered in memory prior to running the reduce step? If so, what can I change to stop the buffering? I just need the map output to go directly to one large file.

Thanks,
Scott

Reply via email to