Can you share the reason of choosing snappy as your compression codec? Like @omalley mentioned, RCFile will compress the data more densely, and will avoid reading data not required in your hive query. And I think Facebook use it to store tens of PB (if not hundred PB) of data.
Thanks Yongqiang On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley <omal...@apache.org> wrote: > SequenceFile compared to RCFile: > * More widely deployed. > * Available from MapReduce and Pig > * Doesn't compress as small (in RCFile all of each columns values are put > together) > * Uncompresses and deserializes all of the columns, even if you are only > reading a few > > In either case, for long term storage, you should seriously consider the > default codec since that will provide much tighter compression (at the cost > of cpu to compress it). > > -- Owen