On compressed storage : why are sequence files bigger than text files?

Ajo Fod Tue, 18 Jan 2011 06:13:57 -0800

Hello,

My questions in short are:
- why are sequencefiles bigger than textfiles (considering that they
are binary)?
- It looks like compression does not make for a smaller sequence file
than the original text file.


-- here is a sample data that is transfered into the tables below with
an INSERT OVERWRITE
A       09:33:30        N       38.75   109100  0       522486  40
A       09:33:31        M       38.75   200     0       0       0
A       09:33:31        M       38.75   100     0       0       0
A       09:33:31        M       38.75   100     0       0       0
A       09:33:31        M       38.75   100     0       0       0
A       09:33:31        M       38.75   100     0       0       0
A       09:33:31        M       38.75   500     0       0       0

-- so focusing on the column 4 and 5:
-- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long respectively.
-- binary representation: columns 4 and 5 are 4 + 4=8 bytes long respectively.
-- NOTE: I drop the last 3 columns in the table representation.

-- The original size of one sample partition was 132MB  ... extract from <ls> :
132M 2011-01-16 18:20 data/2001-05-22

-- ... so  I set the following hive variables:

set hive.exec.compress.output=true;
set hive.merge.mapfiles = false;
set io.seqfile.compression.type = BLOCK;

-- ... and create the following table.
CREATE TABLE alltrades
      (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol)
SORTED BY (time ASC)
INTO 4 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

-- ... now the table is split into 2 files. (!! shouldn't this be 4
... but that is discussed in the previous mail to this group)
-- The bucket files total 17.5MB.
9,009,080 2011-01-18 05:32
/user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate
8,534,264 2011-01-18 05:32
/user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate

-- ... so, I wondered, what would happen if I used SEQUENCEFILE instead
CREATE TABLE alltrades
      (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol)
SORTED BY (time ASC)
INTO 4 BUCKETS
STORED AS SEQUENCEFILE;

... this created files that were a total of 193MB (larger even than
the original)!!
99,751,137 2011-01-18 05:24
/user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0
93,859,644 2011-01-18 05:24
/user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0

So, in summary:
Why are sequence files bigger than the original?


-Ajo

On compressed storage : why are sequence files bigger than text files?

Reply via email to