Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Edward Capriolo Sat, 05 May 2012 14:45:19 -0700

This is one of the things about hive the key is not easily available.
You are going to need an input format that creates a new value which
is contains the key and the value.


Like this:
<url:Text> <data:CrawlDatum> -> <null-writable>  new
MyKeyValue<<url:Text> <data:CrawlDatum>>


On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
<safdar.kurei...@gmail.com> wrote:
> Hi,
>
> I have attached a Sequence file with the following format:
> <url:Text> <data:CrawlDatum>
>
> (CrawlDatum is a custom Java type, that contains several fields that would
> be flattened into several columns by the SerDe).
>
> In other words, what I would like to do, is to expose this URL+CrawlDatum
> data via a Hive External table, with the following columns:
> || url || status || fetchtime || fetchinterval || modifiedtime || retries ||
> score || metadata ||
>
> So, I was hoping that after defining a custom SerDe, I would just have to
> define the Hive table as follows:
>
> CREATE EXTERNAL TABLE crawldb
> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, modifiedtime
> LONG, retries INT, score FLOAT, metadata MAP)
> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> STORED AS SEQUENCEFILE
> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>
> For example, a sample record should like like the following through a Hive
> table:
> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 ||
> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>
> I would like this to be possible without having to duplicate/flatten the
> data through a separate transformation. Initially, I thought my custom SerDe
> could have following definition for serialize():
>
>         @override
> public Object deserialize(Writable obj) throws SerDeException {
>             ...
>          }
>
> But the problem is that the input argument obj above is only the
> VALUE portion of a Sequence record. There seems to be a limitation with the
> way Hive reads Sequence files. Specifically, for each row in a sequence
> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() method
> below, which ignores the KEY when iterating over a RecordReader (see bold
> text below from the corresponding Hive code for
> FetchOperator::getNextRow()):
>
>   /**
>    * Get the next row. The fetch context is modified appropriately.
>    *
>    **/
>   public InspectableObject getNextRow() throws IOException {
>     try {
>       while (true) {
>         if (currRecReader == null) {
>           currRecReader = getRecordReader();
>           if (currRecReader == null) {
>             return null;
>           }
>         }
>
>         boolean ret = currRecReader.next(key, value);
>         if (ret) {
>           if (this.currPart == null) {
>             Object obj = serde.deserialize(value);
>             return new InspectableObject(obj, serde.getObjectInspector());
>           } else {
>             rowWithPart[0] = serde.deserialize(value);
>             return new InspectableObject(rowWithPart, rowObjectInspector);
>           }
>         } else {
>           currRecReader.close();
>           currRecReader = null;
>         }
>       }
>     } catch (Exception e) {
>       throw new IOException(e);
>     }
>   }
>
> As you can see, the "key" variable is ignored and never returned. The
> problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and
> I need it to be displayed in the Hive table along with the fields of
> CrawlDatum. But when writing the the custom SerDe, I only see the CrawlDatum
> that comes after the key, on each record...which is not sufficient.
>
> One hack could be to write a CustomSequenceFileRecordReader.java that
> returns the offset in the sequence file as the KEY, and an aggregation of
> the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
> below from SequenceFileRecordReader, which will get really very messy:
>   protected synchronized boolean next(K key)
>     throws IOException {
>     if (!more) return false;
>     long pos = in.getPosition();
>     boolean remaining = (in.next(key) != null);
>     if (pos >= end && in.syncSeen()) {
>       more = false;
>     } else {
>       more = remaining;
>     }
>     return more;
>   }
>
> This would require me to write a CustomSequenceFileRecordReader and a
> CustomSequenceFileInputFormat and then some custom SerDe, and probably make
> several other changes as well. Is it possible to just get away with writing
> a custom SerDe and some pre-existing reader that includes the key when
> invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
> have this limitation, when accessing Sequence files? I would imagine that
> the key of a sequence file record would be just as important as the
> value...so why is it left out by the FetchOperator:getNextRow() method?
>
> If this is the unfortunate reality with reading sequence files in Nutch, is
> there another Hive storage format I should use that works around this
> limitation? Such as "create external table ..... STORED AS
> CUSTOM_SEQUENCEFILE"? Or, let's say I write my own
> CustomHiveSequenceFileInputFormat, how do i register it with Hive and use it
> in the Hive "STORED AS" definition?
>
> Any help or pointers would be greatly appreciated. I hope I'm mistaken about
> the limitation above, and if not, hopefully there is an easy way to resolve
> this through a custom SerDe alone.
>
> Warm regards,
> Safdar

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Reply via email to