Hi folks, Anyone have any experience using bz2 based compressed tables? I have the following .q file:
== SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; SET hive.exec.max.dynamic.partitions=500; SET hive.exec.max.dynamic.partitions.pernode=500; SET hive.exec.compress.output=true ; SET mapred.output.compress=true ; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec ; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec ; SET io.seqfile.compression.type=BLOCK; SET mapred.output.compression.type=BLOCK; SET mapred.compress.map.output=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; SET hive.exec.max.dynamic.partitions=500; SET hive.exec.max.dynamic.partitions.pernode=500; SET mapred.child.java.opts=-Xmx2048m ; SET mapred.reduce.tasks=40 ; SET hive.mapred.reduce.tasks.speculative.execution=false ; REATE TABLE stopwords_rcf_bzip2 (word STRING ) STORED AS RCFILE; INSERT OVERWRITE TABLE stopwords_rcf_bzip2 select * from stopwords; == (where stopwords is a pre-existing textfile based table that has various words loaded in from the standard linux dictionary.) After doing this, the write succeeds, and the output appears to be compressed (tested by doing a hadoop fs -get and manual inspection - seems to have RCFile headers, some metadata indicating bz2 compression classes, and then the BZ marker and binary data. If I try reading this, though, I get the following error: == hive -e 'select * from stopwords_rcf_bzip2 limit 20;' WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0.23.jar!/hive-log4j.properties Hive history file=/tmp/sush/hive_job_log_hrt_qa_201304031715_2012500346.txt OK Failed with exception java.io.IOException:java.io.IOException: Stream is not BZip2 formatted: expected 'h' as first byte but got '#' Time taken: 2.582 seconds == If I try the same above commands with Gzip codecs instead of BZip2, it works fine. Does anyone have any idea as to what I'm doing wrong? Is this a bug we need to fix? Also, when I try replacing rcfile with textfile, as well, it doesn't work, except, this time, instead of an IOException about the stream not being a bz2 stream, I just get garbled binary output from the select. Thanks, -Sushanth