(I'm quite new to hadoop and map/reduce, so some of these questions
might not make complete sense.)

I want to perform simple data transforms on large datasets, and it
seems Hadoop is an appropriate tool.  As a simple example, let's say I
want to read every line of a text file, uppercase it, and write it
out.

First question: would Hadoop be an appropriate tool for something like this?

What is the best way to model this type of work in Hadoop?

I'm thinking my mappers will accept a Long key that represents the
byte offset into the input file, and a Text value that represents the
line in the file.

I *could* simply uppercase the text lines and write them to an output
file directly in the mapper (and not use any reducers).  So, there's a
question: is it considered bad practice to write output files directly
from mappers?

Assuming it's advisable in this example to write a file directly in
the mapper - how should the mapper create a unique output partition
file name?  Is there a way for a mapper to know its index in the total
# of mappers?

Assuming it's inadvisable to write a file directly in the mapper - I
can output the records to the reducers using the same key and using
the uppercased data as the value.  Then, in my reducer, should I write
a file?  Or should I collect() the records in the reducers and let
hadoop write the output?

If I let hadoop write the output, is there a way to prevent hadoop
from writing the key to the output file?  I may want to perform
several transformations, one-after-another, on a set of data, and I
don't want to place a superfluous key at the front of every record for
each pass of the data.

I appreciate any feedback anyone has to offer.

Reply via email to