I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and am trying to compress some of our tables. The only difference with the eg is that we have partitioned our tables.
SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.output.compress=true; insert overwrite table keywords_lzo partition (dated = '2010-01-25', client = 'TESTCLIENT') select account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,campaign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,account_id,campaign_id,adgroup_id from keywords where dated = '2010-01-25' and client = 'TESTCLIENT'; The keywords_lzo table was created by running: CREATE TABLE `keywords_lzo` ( `account` STRING, `campaign` STRING, `ad_group` STRING, `keyword_id` STRING, `keyword` STRING, `match_type` STRING, `status` STRING, `first_page_bid` STRING, `quality_score` FLOAT, `distribution` STRING, `max_cpc` FLOAT, `destination_url` STRING, `ad_group_status` STRING, `campaign_status` STRING, `currency_code` STRING, `impressions` INT, `clicks` INT, `ctr` FLOAT, `cpc` STRING, `cost` FLOAT , `avg_position` FLOAT, `account_id` STRING, `campaign_id` STRING, `adgroup_id` STRING ) PARTITIONED BY ( `dated` STRING, `client` STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE; The problem is that the output is 12 files totaling the same size as the input CSV (~750MB). With either LZO or GZIP I get the same behaviour. If I use TEXTFILE then I get the compression I would expect. I can see in the head of the file SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text �'org.apache.hadoop.io.compress.GzipCodec and it does look like garbage so something is happening but the total size is not reduced from the CSV Is the 12 output files significant? They are ~60M each and the blocksize is 64M I tried SET io.seqfile.compression.type=RECORD; also but still 12 files and no reduction in size. Thanks, Tom
