How can you say zip files are 'best codecs' to use?

Call me silly but I seem to recall that if you're using a zip'd file for input 
you can't really use a file splitter?
(Going from memory, which isn't the best thing to do...)

-Mike


-----Original Message-----
From: Stephen Watt [mailto:sw...@us.ibm.com] 
Sent: Monday, July 12, 2010 1:28 PM
To: common-dev@hadoop.apache.org
Subject: Hadoop Compression - Current Status

Please let me know if any of assertions are incorrect. I'm going to be 
adding any feedback to the Hadoop Wiki. It seems well documented that the 
LZO Codec is the most performant codec (
http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) 
but it is GPL infected and thus it is separately maintained here - 
http://github.com/kevinweil/hadoop-lzo. 

With regards to performance, and if you are not using sequential files, 
Gzip is the next best codec to use, followed by bzip2. Hadoop has 
supported being able to process bzip2 and gzip input formats for awhile 
now but it could never split the files. i.e. it assigned one mapper per 
file. There are now 2 new features :
- Splitting bzip2 files available in 0.21.0 - 
https://issues.apache.org/jira/browse/HADOOP-4012
- Splitting gzip files (in progress but patch available) - 
https://issues.apache.org/jira/browse/MAPREDUCE-491

1) It appears most folks are using LZO. Given that it is GPL, are you not 
worried about it virally infecting your project ?
2) Is anyone using the new bzip2 or gzip file split compatible readers? 
How do you like them? General feedback?

Kind regards
Steve Watt


The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, please 
notify the sender and delete/destroy the original message and any copy of it 
from your computer or paper files.

Reply via email to