I'm wondering if my configuration/stack is wrong, or if I'm trying to do something that is not supported in Hive. My goal is to choose a compression scheme for Hadoop/Hive and while comparing configurations, I'm finding that I can't get BZip2 or Gzip to work with the RCfile format. Is that supported, i.e. using BZip2 or Gzip with RCfile?
LZO appears to be fastest solution, at the price of not compressing as well as deflate, but I'm wondering if the problems with Gzip and BZip2 I'm seeing mean that it's futile to explore LZO due to some stack or configuration problem or bug. My stack is: Cloudera (CDH3B3) Hive "0.7", i.e. SVN version r1065698 Is anyone aware of a bug or incompatibility between those versions of those tools that could produce what I'm seeing? It looks like the best info on LZO-compression is at these links: https://github.com/toddlipcon/hadoop-lzo http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ If anyone knows of other useful docs for configuring compression in general (or LZO on CDH3B3 specifially), please let me know. I've enclosed some test code below that shows the options I'm using to successfully produce an RC-file in one case, and use BZip2 in another, but I get an error when combining those options. "Dual" is just a table with one row and one column which I use to generate test data. As always, any help is appreciated - even an example like what I have below that shows RCfiles working with BZip2 and/or LZO. Thanks. DETAILS ------------------------------------------------------------------------------------------------------------------------ -- This succeeds: reading a default-codec compressed RC-File DROP TABLE x_pwy_rctest; CREATE TABLE x_pwy_rctest ( col1 string, col2 string ) --ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" -- seems redundant, turns out things work without specifying this STORED AS RCFILE; SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; INSERT OVERWRITE TABLE x_pwy_rctest select t.* from ( SELECT 'c1a', 'c2a' FROM DUAL union all SELECT 'c1b', 'c2b' FROM DUAL union all SELECT 'c1c', 'c2c' FROM DUAL union all SELECT 'c1d', 'c2d' FROM DUAL union all SELECT 'c1e', 'c2e' FROM DUAL union all SELECT 'c1f', 'c2f' FROM DUAL union all SELECT 'c1g', 'c2g' FROM DUAL union all SELECT 'c1h', 'c2h' FROM DUAL ) t; select * from x_pwy_rctest; ------------------------------------------------------------------------------------------------------------------------ -- This fails: reading a BZip2 compressed RC-File --ERROR: Failed with exception java.io.IOException:java.io.IOException: Stream is not BZip2 formatted: expected 'h' as first byte but got '�' DROP TABLE x_pwy_rctest; CREATE TABLE x_pwy_rctest ( col1 string, col2 string ) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" -- seems redundant, works without this STORED AS RCFILE; SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.compress.map.output=true; SET mapred.output.compress=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec; INSERT OVERWRITE TABLE x_pwy_rctest select t.* from ( SELECT 'c1a', 'c2a' FROM DUAL union all SELECT 'c1b', 'c2b' FROM DUAL union all SELECT 'c1c', 'c2c' FROM DUAL union all SELECT 'c1d', 'c2d' FROM DUAL union all SELECT 'c1e', 'c2e' FROM DUAL union all SELECT 'c1f', 'c2f' FROM DUAL union all SELECT 'c1g', 'c2g' FROM DUAL union all SELECT 'c1h', 'c2h' FROM DUAL ) t; select * from x_pwy_rctest; Failed with exception java.io.IOException:java.io.IOException: Stream is not BZip2 formatted: expected 'h' as first byte but got '�' ------------------------------------------------------------------------------------------------------------------------ -- This succeeds: Using BZip2 with a Sequence File DROP TABLE x_pwy_seqtest; CREATE TABLE x_pwy_seqtest ( col1 string, col2 string ) STORED AS SEQUENCEFILE; SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.compress.map.output=true; SET mapred.output.compress=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec; INSERT OVERWRITE TABLE x_pwy_seqtest select t.* from ( SELECT 'c1a', 'c2a' FROM DUAL union all SELECT 'c1b', 'c2b' FROM DUAL union all SELECT 'c1c', 'c2c' FROM DUAL union all SELECT 'c1d', 'c2d' FROM DUAL union all SELECT 'c1e', 'c2e' FROM DUAL union all SELECT 'c1f', 'c2f' FROM DUAL union all SELECT 'c1g', 'c2g' FROM DUAL union all SELECT 'c1h', 'c2h' FROM DUAL ) t; select * from x_pwy_seqtest; ------------------------------------------------------------------------------------------------------------------------