Re: Compressed data storage in HDFS - Error

Denny Lee Fri, 08 Jun 2012 21:23:05 -0700

Out of curiosity, why not bz2 which is splittable?  Definitely will try out 
snappy in the meantime. Thanks!


@dennylee | http://about.me/dennylee

On Jun 8, 2012, at 8:42 PM, Raja Thiruvathuru <thiruvath...@gmail.com> wrote:

> Agree with Mark.
> 
> On Fri, Jun 8, 2012 at 5:08 PM, Mark Grover <grover.markgro...@gmail.com> 
> wrote:
> Hi Sreenath,
> All the points made on this thread are very valid. However, I wanted to add 
> that you should keep in mind that Gzip compression is not splittable. This is 
> because of the very nature of the codec. So, if your input data contains 
> files of size greater than HDFS block size in Gzip format, Hadoop wouldn't be 
> able to split these files and the entire file would be sent to a single 
> mapper. This reduces performance of the job.
> 
> As Vinod mentioned, Snappy is getting some traction. Definitely worth a shot!
> 
> Good luck!
> Mark
> 
> On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <vi...@vinodsingh.com> wrote:
> But it may payoff by saving on network IO while copying the data during 
> reduce phase. Though it will vary from case to case. We had good results by 
> using Snappy codec for compressing map output. Snappy provides reasonably 
> good compression at faster rate.
> 
> Thanks,
> Vinod
> 
> http://blog.vinodsingh.com/
> 
> 
> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <debarshi.ba...@tcs.com> wrote:
> Compression is an overhead when you have a CPU intensive job
> 
> 
> Debarshi Basak
> Tata Consultancy Services
> Mailto: debarshi.ba...@tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty. IT Services
> Business Solutions
> Outsourcing
> ____________________________________________
> 
> -----Bejoy Ks wrote: -----
> To: "user@hive.apache.org" <user@hive.apache.org>
> From: Bejoy Ks <bejoy...@yahoo.com>
> Date: 06/06/2012 03:37PM
> Subject: Re: Compressed data storage in HDFS - Error
> 
> 
> Hi Sreenath
> 
> Output compression is more useful on storage level, when a larger file is 
> compressed it saves on hdfs blocks and there by the cluster become more 
> scalable in terms of number of files. 
> 
> Yes lzo libraries needs to be there in all task tracker nodes as well the 
> node that hosts the hive client.
> 
> Regards
> Bejoy KS
> 
> From: Sreenath Menon <sreenathmen...@gmail.com>
> To: user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> 
> Sent: Wednesday, June 6, 2012 3:25 PM
> Subject: Re: Compressed data storage in HDFS - Error
> 
> Hi Bejoy
> I would like to make this clear.
> There is no gain on processing throughput/time on compressing the data stored 
> in HDFS (not talking about intermediate compression)...wright??
> And do I need to add the lzo libraries in Hadoop_Home/lib/native for all the 
> nodes (including the slave nodes)??
> 
> 
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 
> 
> 
> 
> 
> -- 
> 
> Raja Thiruvathuru

Re: Compressed data storage in HDFS - Error

Reply via email to