Thanks for the quick reply. I looked at it, but still could not figure out how to use HDFS to store input data (binary) and call an executable. Please note that I cannot modify the executable.
May be I am asking some dumb question, but could you please explain a bit of how to handle the scenario I have described. Thanks, Jaliya -----Original Message----- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Thursday, August 20, 2009 3:00 PM To: common-dev@hadoop.apache.org Cc: core-...@hadoop.apache.org; core-u...@hadoop.apache.org; spo...@gmail.com Subject: Re: Using Hadoop with executables and binary data Look into "typed bytes": http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/ On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake <jnekanay...@gmail.com>wrote: > Hi Stefan, > > > > I am sorry, for the late reply. Somehow the response email has slipped my > eyes. > > Could you explain a bit on how to use Hadoop streaming with binary data > formats. > > I can see, explanations on using it with text data formats, but not for > binary files. > > > Thank you, > > Jaliya > > Stefan Podkowinski > Mon, 10 Aug 2009 01:40:05 -0700 > > Jaliya, > > did you consider Hadoop Streaming for your case? > http://wiki.apache.org/hadoop/HadoopStreaming > > > On Wed, Jul 29, 2009 at 8:35 AM, Jaliya > Ekanayake<jekan...@cs.indiana.edu> wrote: > > Dear Hadoop devs, > > > > > > > > Please help me to figure out a way to program the following problem using > > Hadoop. > > > > I have a program which I need to invoke in parallel using Hadoop. The > > program takes an input file(binary) and produce an output file (binary) > > > > > > > > Input.bin ->prog.exe-> output.bin > > > > > > > > The input data set is about 1TB in size. Each input data file is about > 33MB > > in size. (So I have about 31000 files) > > > > The output binary file is about 9KBs in size. > > > > > > > > I have implemented this program using Hadoop in the following way. > > > > > > > > I keep the input data in a shared parallel file system (Lustre File > System). > > > > Then, I collect the input file names and write them to a collection of > files > > in HDFS (let's say hdfs_input_0.txt ..). > > > > Each hdfs_input file contains roughly the equal number of files URIs to > the > > original input file. > > > > The map task, simply take a string Value which is a URI to an original > input > > data file and execute the program as an external program. > > > > The output of the program is also written to the shared file system > (Lustre > > File System). > > > > > > > > The problem in this approach is I am not utilizing the true benefit of > > MapReduce. The use of local disks. > > > > Could you please suggest me a way to use local disks for the above > > problem.? > > > > > > > > I thought, of the following way, but would like to verify from you if > there > > is a better way. > > > > > > > > 1. Upload the original data files in HDFS > > > > 2. In the map task, read the data file as an binary object. > > > > 3. Save it in the local file system. > > > > 4. Call the executable > > > > 5. Push the output from the local file system to HDFS. > > > > > > > > Any suggestion is greatly appreciated. > > > > > > Thank you, > > > > Jaliya > > > > > > > > > > > > > > > > > > > > > > > >