Knowing that sequencefiles can store data (especially numeric data) much
more compact that text, i started converting our hive database from lzo
compressed text format to lzo compressed sequencdfiles.

My first observation was that the files were not smaller, which surprised me
since we have mostly numerical data which has a more compact binary
representation.

So then i issued some "describe extended" queries to poke around in the
sequencefile format used by hive. And it seems that 1) the keys are not
used, and 2) all the values are simply stored as a Text Writable? Is this
simply a copy of the textual representation which was used in the text
files? That would explain why the data did not get any smaller. But it also
would defeat all the benefits of sequencefiles, no?

Thanks Koert

Reply via email to