RE: using the key from a SequenceFile

Ruben de Vries Thu, 19 Apr 2012 05:21:46 -0700

Hive can handle a sequence file just like a text file, only it omits the key 
completely and only uses the value part of it, other than that you won't notice 
the difference between sequence or plain text file

From: David Kulp [mailto:dk...@fiksu.com]
Sent: Thursday, April 19, 2012 2:13 PM
To: user@hive.apache.org
Subject: Re: using the key from a SequenceFile

I'm trying to achieve something very similar.  I want to write an MR program 
that writes results in a record-based sequencefile that would be directly 
readable from hive as though it were created using "STORED AS SEQUENCEFILE" 
with, say, BinarySortableSerDe.

>From this discussion it seems that Hive does not / cannot take advantage of 
>the key/values in a sequencefile, but rather it requires a value that is 
>serialized using a SerDe.  Is that right?

If so, does that mean that the right approach is to using the 
BinarySortableSerDe to pass the collector a row's worth of data as the Writable 
value.  And would Hive "just work" on such data?

If SequencefileOutputFormat is used, will it automatically place sync markers 
in the file to allow for file splitting?

Thanks!

(ps. As an aside, Avro would be better.  Wouldn't it be a huge win for 
MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that 
read such files?  It seems like there's a natural correspondence between the 
richer data representations of an SQL schema and an Avro schema, and there's 
already code for working with Avro in MR as input.)

On Apr 19, 2012, at 6:15 AM, madhu phatak wrote:

Serde will allow you to create custom data from your sequence File  
https://cwiki.apache.org/confluence/display/Hive/SerDe
On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries 
<ruben.devr...@hyves.nl<mailto:ruben.devr...@hyves.nl>> wrote:
I'm trying to migrate a part of our current hadoop jobs from normal mapreduce 
jobs to hive,
Previously the data was stored in sequencefiles with the keys containing 
valueable data!
However if I load the data into a table I loose that key data (or at least I 
can't access it with hive), I want to somehow use the key from the sequence 
file in hive.

I know this has come up before since I can find some hints of people needing it 
but I can't seem to find a working solution and since I'm not very good with 
java I really can't get it done myself :(.
Does anyone have a snippet of something like this working?

I get errors like;
../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol
    [javac] symbol  : constructor SequenceFileRecordReader()
    [javac] location: class 
org.apache.hadoop.mapred.SequenceFileRecordReader<K,V>
    [javac] public class CustomSeqRecordReader<K, V> extends 
SequenceFileRecordReader<K, V> implements RecordReader<K, V> {

Hope some1 has a snippet or can help me out, would really love to be able to 
switch part of our jobs to hive,

Ruben de Vries

--
https://github.com/zinnia-phatak-dev/Nectar

RE: using the key from a SequenceFile

Reply via email to