This is one of the things about hive the key is not easily available. You are going to need an input format that creates a new value which is contains the key and the value.
Like this: <url:Text> <data:CrawlDatum> -> <null-writable> new MyKeyValue<<url:Text> <data:CrawlDatum>> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy <safdar.kurei...@gmail.com> wrote: > Hi, > > I have attached a Sequence file with the following format: > <url:Text> <data:CrawlDatum> > > (CrawlDatum is a custom Java type, that contains several fields that would > be flattened into several columns by the SerDe). > > In other words, what I would like to do, is to expose this URL+CrawlDatum > data via a Hive External table, with the following columns: > || url || status || fetchtime || fetchinterval || modifiedtime || retries || > score || metadata || > > So, I was hoping that after defining a custom SerDe, I would just have to > define the Hive table as follows: > > CREATE EXTERNAL TABLE crawldb > (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, modifiedtime > LONG, retries INT, score FLOAT, metadata MAP) > ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe' > STORED AS SEQUENCEFILE > LOCATION '/user/training/deepcrawl/crawldb/current/part-00000'; > > For example, a sample record should like like the following through a Hive > table: > || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 || > 1 || 0.98 || {x=1,y=2,p=3,q=4} || > > I would like this to be possible without having to duplicate/flatten the > data through a separate transformation. Initially, I thought my custom SerDe > could have following definition for serialize(): > > @override > public Object deserialize(Writable obj) throws SerDeException { > ... > } > > But the problem is that the input argument obj above is only the > VALUE portion of a Sequence record. There seems to be a limitation with the > way Hive reads Sequence files. Specifically, for each row in a sequence > file, the KEY is ignored and only the VALUE is used by Hive. This is seen > from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() method > below, which ignores the KEY when iterating over a RecordReader (see bold > text below from the corresponding Hive code for > FetchOperator::getNextRow()): > > /** > * Get the next row. The fetch context is modified appropriately. > * > **/ > public InspectableObject getNextRow() throws IOException { > try { > while (true) { > if (currRecReader == null) { > currRecReader = getRecordReader(); > if (currRecReader == null) { > return null; > } > } > > boolean ret = currRecReader.next(key, value); > if (ret) { > if (this.currPart == null) { > Object obj = serde.deserialize(value); > return new InspectableObject(obj, serde.getObjectInspector()); > } else { > rowWithPart[0] = serde.deserialize(value); > return new InspectableObject(rowWithPart, rowObjectInspector); > } > } else { > currRecReader.close(); > currRecReader = null; > } > } > } catch (Exception e) { > throw new IOException(e); > } > } > > As you can see, the "key" variable is ignored and never returned. The > problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and > I need it to be displayed in the Hive table along with the fields of > CrawlDatum. But when writing the the custom SerDe, I only see the CrawlDatum > that comes after the key, on each record...which is not sufficient. > > One hack could be to write a CustomSequenceFileRecordReader.java that > returns the offset in the sequence file as the KEY, and an aggregation of > the (Key+Value) as the VALUE. For that, perhaps I need to hack the code > below from SequenceFileRecordReader, which will get really very messy: > protected synchronized boolean next(K key) > throws IOException { > if (!more) return false; > long pos = in.getPosition(); > boolean remaining = (in.next(key) != null); > if (pos >= end && in.syncSeen()) { > more = false; > } else { > more = remaining; > } > return more; > } > > This would require me to write a CustomSequenceFileRecordReader and a > CustomSequenceFileInputFormat and then some custom SerDe, and probably make > several other changes as well. Is it possible to just get away with writing > a custom SerDe and some pre-existing reader that includes the key when > invoking SerDe.deserialize()? Unless I'm missing something, why does Hive > have this limitation, when accessing Sequence files? I would imagine that > the key of a sequence file record would be just as important as the > value...so why is it left out by the FetchOperator:getNextRow() method? > > If this is the unfortunate reality with reading sequence files in Nutch, is > there another Hive storage format I should use that works around this > limitation? Such as "create external table ..... STORED AS > CUSTOM_SEQUENCEFILE"? Or, let's say I write my own > CustomHiveSequenceFileInputFormat, how do i register it with Hive and use it > in the Hive "STORED AS" definition? > > Any help or pointers would be greatly appreciated. I hope I'm mistaken about > the limitation above, and if not, hopefully there is an easy way to resolve > this through a custom SerDe alone. > > Warm regards, > Safdar