Re: Simple data transformations in Hadoop?

Owen O'Malley Sun, 14 Dec 2008 12:57:12 -0800


On Dec 13, 2008, at 6:32 PM, Stuart White wrote:

First question: would Hadoop be an appropriate tool for somethinglike this?


Very

What is the best way to model this type of work in Hadoop?


As a map-only job with number of reduces = 0.

I'm thinking my mappers will accept a Long key that represents the
byte offset into the input file, and a Text value that represents the
line in the file.

Sure, just use TextInputFormat. You'll want to set the minimum splitsize (mapred.min.split.size) to a large number so that you get exactly1 map per an input file.

I *could* simply uppercase the text lines and write them to an output
file directly in the mapper (and not use any reducers).  So, there's a
question: is it considered bad practice to write output files directly
from mappers?

You could do it directly, but I would suggest that using theTextOutputFormat is easier.


Your map should just do:
  collect(null, upperCaseLine);

Assuming that number of reduces is 0, the output of the map goesstraight to the OutputCollector.

Assuming it's advisable in this example to write a file directly in
the mapper - how should the mapper create a unique output partition
file name?  Is there a way for a mapper to know its index in the total
# of mappers?


Get mapred.task.partition from the configuration.

Assuming it's inadvisable to write a file directly in the mapper - I
can output the records to the reducers using the same key and using
the uppercased data as the value.  Then, in my reducer, should I write
a file?  Or should I collect() the records in the reducers and let
hadoop write the output?

See above, but with no reduces the data is not sorted. If you pass anull or NullWritable to the TextOutputFormat, it will not add the tab.


-- Owen

Re: Simple data transformations in Hadoop?

Reply via email to