Hello, My questions in short are: - why are sequencefiles bigger than textfiles (considering that they are binary)? - It looks like compression does not make for a smaller sequence file than the original text file.
-- here is a sample data that is transfered into the tables below with an INSERT OVERWRITE A 09:33:30 N 38.75 109100 0 522486 40 A 09:33:31 M 38.75 200 0 0 0 A 09:33:31 M 38.75 100 0 0 0 A 09:33:31 M 38.75 100 0 0 0 A 09:33:31 M 38.75 100 0 0 0 A 09:33:31 M 38.75 100 0 0 0 A 09:33:31 M 38.75 500 0 0 0 -- so focusing on the column 4 and 5: -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long respectively. -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long respectively. -- NOTE: I drop the last 3 columns in the table representation. -- The original size of one sample partition was 132MB ... extract from <ls> : 132M 2011-01-16 18:20 data/2001-05-22 -- ... so I set the following hive variables: set hive.exec.compress.output=true; set hive.merge.mapfiles = false; set io.seqfile.compression.type = BLOCK; -- ... and create the following table. CREATE TABLE alltrades (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT) PARTITIONED BY (dt STRING) CLUSTERED BY (symbol) SORTED BY (time ASC) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ; -- ... now the table is split into 2 files. (!! shouldn't this be 4 ... but that is discussed in the previous mail to this group) -- The bucket files total 17.5MB. 9,009,080 2011-01-18 05:32 /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate 8,534,264 2011-01-18 05:32 /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead CREATE TABLE alltrades (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT) PARTITIONED BY (dt STRING) CLUSTERED BY (symbol) SORTED BY (time ASC) INTO 4 BUCKETS STORED AS SEQUENCEFILE; ... this created files that were a total of 193MB (larger even than the original)!! 99,751,137 2011-01-18 05:24 /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0 93,859,644 2011-01-18 05:24 /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0 So, in summary: Why are sequence files bigger than the original? -Ajo