Hi Walter, On Thu, May 28, 2009 at 6:52 AM, walter steffe <ste...@tiscali.it> wrote: > Hello > I am a new user and I would like to use hadoop streaming with > SequenceFile in both input and output side. > > -The first difficoulty arises from the lack of a simple tool to generate > a SequenceFile starting from a set of files in a given directory. > I would like to have something similar to "tar -cvf file.tar foo/" > This should work also in the opposite direction like "tar -xvf file.tar"
There's a tool for turning tar files into sequence files here: http://stuartsierra.com/2008/04/24/a-million-little-files > > -An other important feature that I would like to see is the possibility > to feed the mapper stdin with the whole content of a file (extracted > from the file SequenceFile) disregarding the key. Have a look at SequenceFileAsTextInputFormat which will do this for you (except the key is the sequence file's key). > Using each file as a tar archive I it would like to be able to do: > > $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ > -input "/user/me/inputSequenceFile" \ > -output "/user/me/outputSequenceFile" \ > -inputformat SequenceFile > -outputformat SequenceFile > -mapper myscript.sh > -reducer NONE > > myscrip.sh should work as a filter which takes its input from > stdin and put the output on stdout: > > tar -x > "do something on the generated dir and create an outputfile" > cat outputfile > > The output file should (automatically) go into the outputSequenceFile. > > I think that this would be a very usefull schema which fits well with > the mapreduce requirements on one side and with the unix commands on the > other side. It should not be too difficoult to implement the tools > needed for that. I totally agree - having more tools to better integrate sequence files and map files with unix tools would be very handy. Tom > > > Walter > > > > > > > >