Out of curiosity, why not bz2 which is splittable? Definitely will try out snappy in the meantime. Thanks!
@dennylee | http://about.me/dennylee On Jun 8, 2012, at 8:42 PM, Raja Thiruvathuru <thiruvath...@gmail.com> wrote: > Agree with Mark. > > On Fri, Jun 8, 2012 at 5:08 PM, Mark Grover <grover.markgro...@gmail.com> > wrote: > Hi Sreenath, > All the points made on this thread are very valid. However, I wanted to add > that you should keep in mind that Gzip compression is not splittable. This is > because of the very nature of the codec. So, if your input data contains > files of size greater than HDFS block size in Gzip format, Hadoop wouldn't be > able to split these files and the entire file would be sent to a single > mapper. This reduces performance of the job. > > As Vinod mentioned, Snappy is getting some traction. Definitely worth a shot! > > Good luck! > Mark > > On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <vi...@vinodsingh.com> wrote: > But it may payoff by saving on network IO while copying the data during > reduce phase. Though it will vary from case to case. We had good results by > using Snappy codec for compressing map output. Snappy provides reasonably > good compression at faster rate. > > Thanks, > Vinod > > http://blog.vinodsingh.com/ > > > On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <debarshi.ba...@tcs.com> wrote: > Compression is an overhead when you have a CPU intensive job > > > Debarshi Basak > Tata Consultancy Services > Mailto: debarshi.ba...@tcs.com > Website: http://www.tcs.com > ____________________________________________ > Experience certainty. IT Services > Business Solutions > Outsourcing > ____________________________________________ > > -----Bejoy Ks wrote: ----- > To: "user@hive.apache.org" <user@hive.apache.org> > From: Bejoy Ks <bejoy...@yahoo.com> > Date: 06/06/2012 03:37PM > Subject: Re: Compressed data storage in HDFS - Error > > > Hi Sreenath > > Output compression is more useful on storage level, when a larger file is > compressed it saves on hdfs blocks and there by the cluster become more > scalable in terms of number of files. > > Yes lzo libraries needs to be there in all task tracker nodes as well the > node that hosts the hive client. > > Regards > Bejoy KS > > From: Sreenath Menon <sreenathmen...@gmail.com> > To: user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> > Sent: Wednesday, June 6, 2012 3:25 PM > Subject: Re: Compressed data storage in HDFS - Error > > Hi Bejoy > I would like to make this clear. > There is no gain on processing throughput/time on compressing the data stored > in HDFS (not talking about intermediate compression)...wright?? > And do I need to add the lzo libraries in Hadoop_Home/lib/native for all the > nodes (including the slave nodes)?? > > > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > > > > > > -- > > Raja Thiruvathuru