Hadoop 0.20.2 Hive 0.7.X I got convinced that installing google-snappy would be awesome, so I spent the day it took to build and patch snappy in. I actually found that I did not get good compression from snappy 30% smaller vs 50% from gzip. That is another story.
I decided to start playing with: set io.seqfile.compress.blocksize=10000000; Since all the tuning blogs on the internet suggest it. (they also commonly misname variables like http://code.google.com/p/hadoop-snappy/ compression not compress.) Check this out. set io.seqfile.compression.type=BLOCK; set mapred.compress.map.output=true; set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; set hive.exec.compress.output=true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; set io.seqfile.compress.blocksize=10000000; create table act_seq_snappy_10mbblock stored as sequencefile as select * from fracture_act where hit_date=20120106 and mid>001400 and mid<001420; set io.seqfile.compress.blocksize=20000000; create table act_seq_snappy_20mbblock stored as sequencefile as select * from fracture_act where hit_date=20120106 and mid>001400 and mid<001420; hive> dfs -count hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_20mbblock > ; 1 2 414559506 hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_20mbblock hive> dfs -count hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_10mbblock > ; 1 2 414559506 hdfs://rs01.hadoop.pvt:34310/user/hive/warehouse/act_seq_snappy_10mbblock How can it be that chosing two different block sizes results in exactly the same file size result? Also tried. set io.seqfile.compression.type=BLOCK; set mapred.output.compression.type=BLOCK; Also tried gzip not snappy. Has anyone every actually experienced io.seqfile.compress.blocksize working? My first idea is that hive is swallowing this somehow and not passing it along to hadoop, however after reading all the "performance blogs" talking about it I am mildly convinced this variable does nothing, since all the "performance blogs" rarely even get variable names right. Edward