Hadoop streaming is the simplest way to do this, if you program is set up to 
take stdin as its input, write to stdout for the output, and each record "file" 
in your case is a single line of text.

You need to be able to have it work with the following shell script

Hadoop fs -cat <input_file> | head -1 | ./myprocess > output.txt

And ideally what is stored in output.txt are lines of text that can have their 
order rearranged without impacting the result (This is not a requirement unless 
you want to use a reduce too, but streaming will still try to parse it that way.

If not there are tricks you can play to make it work, but they are kind of ugly.

--Bobby Evans


On 8/22/11 2:57 PM, "Zhixuan Zhu" <z...@calpont.com> wrote:

Hi All,

I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
about FileInputFormat a few days ago and get some prompt replys from
this forum and it helped a lot. Thanks again! Now I have another
question. I'm trying to invoke a C++ process from my mapper for each
hdfs file in the input directory to achieve some parallel processing.
But how do I pass the file to the program? I would want to do something
like the following in my mapper:

Process lChldProc = Runtime.getRuntime().exec("myprocess -file
$filepath");

How do I pass the hdfs filesystem to an outside process like that? Is
HadoopStreaming the direction I should go?

Thanks very much for any reply in advance.

Best,
Grace

Reply via email to