Hi Sreenath, All the points made on this thread are very valid. However, I wanted to add that you should keep in mind that Gzip compression is not splittable. This is because of the very nature of the codec. So, if your input data contains files of size greater than HDFS block size in Gzip format, Hadoop wouldn't be able to split these files and the entire file would be sent to a single mapper. This reduces performance of the job.
As Vinod mentioned, Snappy is getting some traction. Definitely worth a shot! Good luck! Mark On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <vi...@vinodsingh.com> wrote: > But it may payoff by saving on network IO while copying the data during > reduce phase. Though it will vary from case to case. We had good results by > using Snappy codec for compressing map output. Snappy provides reasonably > good compression at faster rate. > > Thanks, > Vinod > > http://blog.vinodsingh.com/ > > > On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <debarshi.ba...@tcs.com>wrote: > >> Compression is an overhead when you have a CPU intensive job >> >> >> Debarshi Basak >> Tata Consultancy Services >> Mailto: debarshi.ba...@tcs.com >> Website: http://www.tcs.com >> ____________________________________________ >> Experience certainty. IT Services >> Business Solutions >> Outsourcing >> ____________________________________________ >> >> -----Bejoy Ks ** wrote: -----** >> >> To: "user@hive.apache.org" <user@hive.apache.org> >> From: Bejoy Ks <bejoy...@yahoo.com> >> Date: 06/06/2012 03:37PM >> Subject: Re: Compressed data storage in HDFS - Error >> >> >> Hi Sreenath >> >> Output compression is more useful on storage level, when a larger file is >> compressed it saves on hdfs blocks and there by the cluster become more >> scalable in terms of number of files. >> >> Yes lzo libraries needs to be there in all task tracker nodes as well the >> node that hosts the hive client. >> >> Regards >> Bejoy KS >> >> ------------------------------ >> *From:* Sreenath Menon <sreenathmen...@gmail.com> >> *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> >> *Sent:* Wednesday, June 6, 2012 3:25 PM >> *Subject:* Re: Compressed data storage in HDFS - Error >> >> Hi Bejoy >> I would like to make this clear. >> There is no gain on processing throughput/time on compressing the data >> stored in HDFS (not talking about intermediate compression)...wright?? >> And do I need to add the lzo libraries in Hadoop_Home/lib/native for all >> the nodes (including the slave nodes)?? >> >> >> =====-----=====-----===== >> Notice: The information contained in this e-mail >> message and/or attachments to it may contain >> confidential or privileged information. If you are >> not the intended recipient, any dissemination, use, >> review, distribution, printing or copying of the >> information contained in this e-mail message >> and/or attachments to it are strictly prohibited. If >> you have received this communication in error, >> please notify us by reply e-mail or telephone and >> immediately and permanently delete the message >> and any attachments. Thank you >> >> >