Also, fwiw, the use of codecs and SequenceFiles are somewhat orthogonal. You'll have to compress the sequencefile with a codec, be it gzip, bz2 or lzo. SequenceFiles do get you splittability which you won't get with just Gzip (until we get MAPREDUCE-491) or the hadoop-lzo InputFormats.
cheers, - Patrick On Mon, Jul 12, 2010 at 2:42 PM, Segel, Mike <mse...@navteq.com> wrote: > How can you say zip files are 'best codecs' to use? > > Call me silly but I seem to recall that if you're using a zip'd file for > input you can't really use a file splitter? > (Going from memory, which isn't the best thing to do...) > > -Mike > > > -----Original Message----- > From: Stephen Watt [mailto:sw...@us.ibm.com] > Sent: Monday, July 12, 2010 1:28 PM > To: common-dev@hadoop.apache.org > Subject: Hadoop Compression - Current Status > > Please let me know if any of assertions are incorrect. I'm going to be > adding any feedback to the Hadoop Wiki. It seems well documented that the > LZO Codec is the most performant codec ( > http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) > but it is GPL infected and thus it is separately maintained here - > http://github.com/kevinweil/hadoop-lzo. > > With regards to performance, and if you are not using sequential files, > Gzip is the next best codec to use, followed by bzip2. Hadoop has > supported being able to process bzip2 and gzip input formats for awhile > now but it could never split the files. i.e. it assigned one mapper per > file. There are now 2 new features : > - Splitting bzip2 files available in 0.21.0 - > https://issues.apache.org/jira/browse/HADOOP-4012 > - Splitting gzip files (in progress but patch available) - > https://issues.apache.org/jira/browse/MAPREDUCE-491 > > 1) It appears most folks are using LZO. Given that it is GPL, are you not > worried about it virally infecting your project ? > 2) Is anyone using the new bzip2 or gzip file split compatible readers? > How do you like them? General feedback? > > Kind regards > Steve Watt > > > The information contained in this communication may be CONFIDENTIAL and is > intended only for the use of the recipient(s) named above. If you are not > the intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its contents, is > strictly prohibited. If you have received this communication in error, > please notify the sender and delete/destroy the original message and any > copy of it from your computer or paper files. >