On Dec 13, 2008, at 6:32 PM, Stuart White wrote:
First question: would Hadoop be an appropriate tool for something
like this?
Very
What is the best way to model this type of work in Hadoop?
As a map-only job with number of reduces = 0.
I'm thinking my mappers will accept a Long key that represents the
byte offset into the input file, and a Text value that represents the
line in the file.
Sure, just use TextInputFormat. You'll want to set the minimum split
size (mapred.min.split.size) to a large number so that you get exactly
1 map per an input file.
I *could* simply uppercase the text lines and write them to an output
file directly in the mapper (and not use any reducers). So, there's a
question: is it considered bad practice to write output files directly
from mappers?
You could do it directly, but I would suggest that using the
TextOutputFormat is easier.
Your map should just do:
collect(null, upperCaseLine);
Assuming that number of reduces is 0, the output of the map goes
straight to the OutputCollector.
Assuming it's advisable in this example to write a file directly in
the mapper - how should the mapper create a unique output partition
file name? Is there a way for a mapper to know its index in the total
# of mappers?
Get mapred.task.partition from the configuration.
Assuming it's inadvisable to write a file directly in the mapper - I
can output the records to the reducers using the same key and using
the uppercased data as the value. Then, in my reducer, should I write
a file? Or should I collect() the records in the reducers and let
hadoop write the output?
See above, but with no reduces the data is not sorted. If you pass a
null or NullWritable to the TextOutputFormat, it will not add the tab.
-- Owen