Also it's the easiest way to SerDe any complex stuff and get split + block compression features since SeqFiles are splittable and could be compressed by default. See the code, it has really complex stuff to transfer between jobs.
2014-11-10 3:06 GMT+03:00 Bertrand Dechoux <[email protected]>: > SequenceFile is/was also the standard for binary data on Hadoop. The > question is rather : what else would you expect? Surely not a text format? > > Bertrand > > On Fri, Nov 7, 2014 at 3:51 AM, Lee S <[email protected]> wrote: > > > any other reasons or can you give a thorough analysis? > > > > 2014-11-05 11:00 GMT+08:00 Ted Dunning <[email protected]>: > > > > > > > > Yes, type conversion is a reason. > > > > > > Sent from my iPhone > > > > > > > On Nov 4, 2014, at 18:59, Lee S <[email protected]> wrote: > > > > > > > > eg. kmeans input: > > > > 1,2,3,4 //text file > > > > kmeans output > > > > point1, point2,point3(text file of center points) > > > > > > > > > > > > I just thought of one reason. The input data should be storaged in > > > > vector(dense or sparse) format ,so a conversion step > > > > needs to be doned before algorithms deal with data. Is that right? > > > > > > > > 2014-11-04 23:56 GMT+08:00 Ted Dunning <[email protected]>: > > > > > > > >> What should the input be? > > > >> > > > >> > > > >> > > > >>> On Tue, Nov 4, 2014 at 12:28 AM, Lee S <[email protected]> wrote: > > > >>> > > > >>> Hi all: > > > >>> I'm wondering why the input and output of most algorithm like > > > >>> kmeans,naivebayes are all sequencefiles. One more step of > conversion > > > need > > > >>> to be done if we want the algorithm works.And > > > >>> I think the step is time consuming. Because it's also a mapreduce > > job. > > > >>> For the reason to deal with small files and compress to save disk > > > >> space? > > > >> > > > > > >
