RCfile is not working with BZip2. Interesting in using LZO in general.

phil young Wed, 02 Mar 2011 18:30:02 -0800

I'm wondering if my configuration/stack is wrong, or if I'm trying to do
something that is not supported in Hive.
My goal is to choose a compression scheme for Hadoop/Hive and while
comparing configurations, I'm finding that I can't get BZip2 or Gzip to work
with the RCfile format.
Is that supported, i.e. using BZip2 or Gzip with RCfile?


LZO appears to be fastest solution, at the price of not compressing as well
as deflate, but I'm wondering if the problems with Gzip and BZip2 I'm seeing
mean that it's futile to explore LZO due to some stack or configuration
problem or bug.

My stack is:
  Cloudera (CDH3B3)
  Hive "0.7", i.e. SVN version r1065698

Is anyone aware of a bug or incompatibility between those versions of those
tools that could produce what I'm seeing?

It looks like the best info on LZO-compression is at these links:
https://github.com/toddlipcon/hadoop-lzo
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
If anyone knows of other useful docs for configuring compression in general
(or LZO on CDH3B3 specifially), please let me know.


I've enclosed some test code below that shows the options I'm using to
successfully produce an RC-file in one case, and use BZip2 in another, but I
get an error when combining those options.
"Dual" is just a table with one row and one column which I use to generate
test data.


As always, any help is appreciated - even an example like what I have below
that shows RCfiles working with BZip2 and/or LZO.

Thanks.








DETAILS



------------------------------------------------------------------------------------------------------------------------
-- This succeeds: reading a default-codec compressed RC-File

DROP TABLE x_pwy_rctest;

CREATE TABLE x_pwy_rctest
(
  col1 string,
  col2 string
)
--ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" --
seems redundant, turns out things work without specifying this
STORED AS RCFILE;

SET hive.exec.compress.output=true;

SET io.seqfile.compression.type=BLOCK;

INSERT OVERWRITE TABLE x_pwy_rctest
select t.*
from
(
SELECT 'c1a', 'c2a' FROM DUAL union all
SELECT 'c1b', 'c2b' FROM DUAL union all
SELECT 'c1c', 'c2c' FROM DUAL union all
SELECT 'c1d', 'c2d' FROM DUAL union all
SELECT 'c1e', 'c2e' FROM DUAL union all
SELECT 'c1f', 'c2f' FROM DUAL union all
SELECT 'c1g', 'c2g' FROM DUAL union all
SELECT 'c1h', 'c2h' FROM DUAL
) t;

select * from x_pwy_rctest;

------------------------------------------------------------------------------------------------------------------------
-- This fails: reading a BZip2 compressed RC-File
--ERROR:
Failed with exception java.io.IOException:java.io.IOException: Stream is not
BZip2 formatted: expected 'h' as first byte but got '�'


DROP TABLE x_pwy_rctest;

CREATE TABLE x_pwy_rctest
(
  col1 string,
  col2 string
)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" --
seems redundant, works without this
STORED AS RCFILE;

SET io.seqfile.compression.type=BLOCK;
SET hive.exec.compress.output=true;

SET mapred.compress.map.output=true;

SET mapred.output.compress=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;


INSERT OVERWRITE TABLE x_pwy_rctest
select t.*
from
(
SELECT 'c1a', 'c2a' FROM DUAL union all
SELECT 'c1b', 'c2b' FROM DUAL union all
SELECT 'c1c', 'c2c' FROM DUAL union all
SELECT 'c1d', 'c2d' FROM DUAL union all
SELECT 'c1e', 'c2e' FROM DUAL union all
SELECT 'c1f', 'c2f' FROM DUAL union all
SELECT 'c1g', 'c2g' FROM DUAL union all
SELECT 'c1h', 'c2h' FROM DUAL
) t;

select * from x_pwy_rctest;

Failed with exception java.io.IOException:java.io.IOException: Stream is not
BZip2 formatted: expected 'h' as first byte but got '�'

------------------------------------------------------------------------------------------------------------------------
-- This succeeds: Using BZip2 with a Sequence File

DROP TABLE x_pwy_seqtest;

CREATE TABLE x_pwy_seqtest
(
  col1 string,
  col2 string
)
STORED AS SEQUENCEFILE;

SET io.seqfile.compression.type=BLOCK;
SET hive.exec.compress.output=true;

SET mapred.compress.map.output=true;

SET mapred.output.compress=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;


INSERT OVERWRITE TABLE x_pwy_seqtest
select t.*
from
(
SELECT 'c1a', 'c2a' FROM DUAL union all
SELECT 'c1b', 'c2b' FROM DUAL union all
SELECT 'c1c', 'c2c' FROM DUAL union all
SELECT 'c1d', 'c2d' FROM DUAL union all
SELECT 'c1e', 'c2e' FROM DUAL union all
SELECT 'c1f', 'c2f' FROM DUAL union all
SELECT 'c1g', 'c2g' FROM DUAL union all
SELECT 'c1h', 'c2h' FROM DUAL
) t;

select * from x_pwy_seqtest;

------------------------------------------------------------------------------------------------------------------------

RCfile is not working with BZip2. Interesting in using LZO in general.

Reply via email to