Not sure what I did wrong the first time but I tried to create a table with 
stored type of textfile and using my custom serde so it had a format line of:

  ROW FORMAT SERDE 'org.myorg.hadoop.hive.udf.MySerDe' STORED AS textfile

Then I loaded a gzipped file using LOAD DATA LOCAL INPATH 'path.gz' INTO TABLE 
mytable and it worked as expected, ie the file was read and I'm able to query 
it using hive.

Sorry to bother and thanks a bunch for the help!  Forcing me to go read more 
about InputFormats is a long term help anyway.

Pat

From: phil young [mailto:phil.wills.yo...@gmail.com]
Sent: Friday, January 28, 2011 1:54 PM
To: user@hive.apache.org
Subject: Re: Custom SerDe Question

To be clear, you would then create the table with the clause:

STORED AS
  INPUTFORMAT 'your.custom.input.format'


If you make an external table, you'll then be able to point to a directory (or 
file) that contains gzipped files, or uncompressed files.



On Fri, Jan 28, 2011 at 4:52 PM, phil young 
<phil.wills.yo...@gmail.com<mailto:phil.wills.yo...@gmail.com>> wrote:
This can be accomplished with a custom input format.

Here's a snippet of the relevant code in the customer RecordReader





            compressionCodecs = new CompressionCodecFactory(jobConf);

            Path file = split.getPath();

            final CompressionCodec codec = compressionCodecs.getCodec(file);

            // open the file and seek to the start of the split

            start = split.getStart();

            end = start + split.getLength();

            pos=0;



            FileSystem fs = file.getFileSystem(jobConf);

            fsdat = fs.open(split.getPath());

            fsdat.seek(start);



            if (codec != null)

            {

                fsin = codec.createInputStream(fsdat);

            }

            else

            {

                fsin = fsdat;

            }








On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat 
<patrick.christop...@hp.com<mailto:patrick.christop...@hp.com>> wrote:
Hi,
I've written a SerDe and I'd like it to be able handle compressed data (gzip).  
Hadoop detects and decompresses on the fly so if you have a compressed data set 
and you don't need to perform any custom interpretation of it as you go, hadoop 
and hive will handle it.  Is there a way to get Hive to notice the data is 
compressed, decompress it then push it through the custom SerDe?  Or will I 
have to either
  a. add some decompression logic to my SerDe (possibly impossible)
  b. decompress the data before pushing it into a table with my SerDe

Thanks!

Pat


Reply via email to