Please let me know if any of assertions are incorrect. I'm going to be adding any feedback to the Hadoop Wiki. It seems well documented that the LZO Codec is the most performant codec ( http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) but it is GPL infected and thus it is separately maintained here - http://github.com/kevinweil/hadoop-lzo.
With regards to performance, and if you are not using sequential files, Gzip is the next best codec to use, followed by bzip2. Hadoop has supported being able to process bzip2 and gzip input formats for awhile now but it could never split the files. i.e. it assigned one mapper per file. There are now 2 new features : - Splitting bzip2 files available in 0.21.0 - https://issues.apache.org/jira/browse/HADOOP-4012 - Splitting gzip files (in progress but patch available) - https://issues.apache.org/jira/browse/MAPREDUCE-491 1) It appears most folks are using LZO. Given that it is GPL, are you not worried about it virally infecting your project ? 2) Is anyone using the new bzip2 or gzip file split compatible readers? How do you like them? General feedback? Kind regards Steve Watt