Hi Hadi The propertis you specified doen't enable compression of map output. To enable map output compression you need to enable the following properties
SET hive.exec.compress.output=true; SET mapred.map.output.compression=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; This property 'hive.exec.compress.intermediate ' Is used to enable compression of data in between multiple mapreduce jobs generated by a hive query. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Hadi Moshayedi <h...@moshayedi.net> Date: Sat, 6 Oct 2012 16:55:47 To: <user@hive.apache.org> Reply-To: user@hive.apache.org Subject: Compression of Intermediate Data I wanted to look into improving performance of my Hive cluster, and from what I read turning on compression of intermediate data could help. As I understand, this would help because it would reduce the amount of data written to disk in between jobs. I look at the documentation and set the following settings: SET hive.exec.compress.intermediate=true; SET mapred.output.compression.type=BLOCK; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; I ran some queries to see how compression impacts the performance. But it usually made the query time worse. I also had a query whose size of intermediate data was close to the size of input data, but it made the performance worse for this query too. Question 1: Are the above settings correct settings for using compression of intermediate data? Question 2: Are there use-cases in which compression of intermediate data would not help performance? Why would someone not keep it turned on always? Thanks