This can be accomplished with a custom input format.

Here's a snippet of the relevant code in the customer RecordReader




            compressionCodecs = new CompressionCodecFactory(jobConf);

            Path file = split.getPath();

            final CompressionCodec codec = compressionCodecs.getCodec(file);

            // open the file and seek to the start of the split

            start = split.getStart();

            end = start + split.getLength();

            pos=0;


            FileSystem fs = file.getFileSystem(jobConf);

            fsdat = fs.open(split.getPath());

            fsdat.seek(start);


            if (codec != null)

            {

                fsin = codec.createInputStream(fsdat);

            }

            else

            {

                fsin = fsdat;

            }






On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat <
patrick.christop...@hp.com> wrote:

> Hi,
>
> I’ve written a SerDe and I’d like it to be able handle compressed data
> (gzip).  Hadoop detects and decompresses on the fly so if you have a
> compressed data set and you don’t need to perform any custom interpretation
> of it as you go, hadoop and hive will handle it.  Is there a way to get Hive
> to notice the data is compressed, decompress it then push it through the
> custom SerDe?  Or will I have to either
>
>   a. add some decompression logic to my SerDe (possibly impossible)
>
>   b. decompress the data before pushing it into a table with my SerDe
>
>
>
> Thanks!
>
>
>
> Pat
>

Reply via email to