Snappy vs LZO - 
To implement lzo, there are several steps, starting from building hadoop-lzo 
library.  Finally we got it built. Indexing had to be done as a separate step 
and the lzo indexing does alter the way the files are stored and thus not use 
hadoop's in built mapper.  Snappy on the other hand comes packages with 
Cloudera.  Since we are using Cloudera distribution, this makes sense to us.  
Lzo compresses better than snappy but for us that was okay since the 
performance is better with snappy sequence file vs lzo

Rc file vs sequencefile - would have gone with RC file for all the resons given 
below but for the reason like Bejoy said, sequence file is widely used.  Looks 
like sqoop may support sequence file with hive import and since we are using 
sqoop a lot, sequence file is a better choice.   

Also tested going back and forth from one compression to another compression 
and one file format to another file format since that is possible, we can 
switch the compression or file format later if we need to.

Thanks,
Chalcy

-----Original Message-----
From: yongqiang he [mailto:heyongqiang...@gmail.com] 
Sent: Wednesday, June 27, 2012 12:41 AM
To: user@hive.apache.org
Subject: Re: hive - snappy and sequence file vs RC file

Can you share the reason of choosing snappy as your compression codec?
Like @omalley mentioned, RCFile will compress the data more densely, and will 
avoid reading data not required in your hive query. And I think Facebook use it 
to store tens of PB (if not hundred PB) of data.

Thanks
Yongqiang
On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley <omal...@apache.org> wrote:
> SequenceFile compared to RCFile:
>   * More widely deployed.
>   * Available from MapReduce and Pig
>   * Doesn't compress as small (in RCFile all of each columns values 
> are put
> together)
>   * Uncompresses and deserializes all of the columns, even if you are 
> only reading a few
>
> In either case, for long term storage, you should seriously consider 
> the default codec since that will provide much tighter compression (at 
> the cost of cpu to compress it).
>
> -- Owen

Reply via email to