Anyone have an idea on this? Anyone using compression with sequence files successfully?
The wiki and Hadoop: the definitive guide suggest that the below is correct so I am at a loss to explain what we are seeing. Tom On Fri, May 6, 2011 at 5:39 PM, Tom Hall <[email protected]> wrote: > I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and > am trying to compress some of our tables. The only difference with the > eg is that we have partitioned our tables. > > > SET io.seqfile.compression.type=BLOCK; > SET hive.exec.compress.output=true; > SET mapred.output.compress=true; > insert overwrite table keywords_lzo partition (dated = '2010-01-25', > client = 'TESTCLIENT') select > account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,campaign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,account_id,campaign_id,adgroup_id > from keywords where dated = '2010-01-25' and client = 'TESTCLIENT'; > > > > The keywords_lzo table was created by running: > > CREATE TABLE `keywords_lzo` ( > `account` STRING, > `campaign` STRING, > `ad_group` STRING, > `keyword_id` STRING, > `keyword` STRING, > `match_type` STRING, > `status` STRING, > `first_page_bid` STRING, > `quality_score` FLOAT, > `distribution` STRING, > `max_cpc` FLOAT, > `destination_url` STRING, > `ad_group_status` STRING, > `campaign_status` STRING, > `currency_code` STRING, > `impressions` INT, > `clicks` INT, > `ctr` FLOAT, > `cpc` STRING, > `cost` FLOAT , > `avg_position` FLOAT, > `account_id` STRING, > `campaign_id` STRING, > `adgroup_id` STRING > ) > PARTITIONED BY ( > `dated` STRING, > `client` STRING > ) > > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > LINES TERMINATED BY '\n' > > STORED AS SEQUENCEFILE; > > > The problem is that the output is 12 files totaling the same size as > the input CSV (~750MB). With either LZO or GZIP I get the same > behaviour. If I use TEXTFILE then I get the compression I would > expect. > > I can see in the head of the file > SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text > �'org.apache.hadoop.io.compress.GzipCodec > and it does look like garbage so something is happening but the total > size is not reduced from the CSV > > > Is the 12 output files significant? They are ~60M each and the blocksize is > 64M > I tried SET io.seqfile.compression.type=RECORD; also but still 12 > files and no reduction in size. > > > Thanks, > Tom >
