Re: using the key from a SequenceFile

David Kulp Thu, 19 Apr 2012 11:13:41 -0700

To answer my own question -- so that someone else may benefit some day -- I've 
found that there is nothing special about key or value formats in a 
SequenceFile.  As has been noted, keys are ignored.  Each new key/value pair is 
seen as a new row from Hive's perspective.  There's no concept of using 
Writables, such as ArrayWritable, to create nested structures in a value field 
that are automatically parsed by Hive.  There are no record delimiters known to 
SequenceFile.  There's just an ignored key and a value that is just a byte 
stream.


Thus, the simplest approach is just to use the Lazy SerDe format to create a 
multi-column row in an MR program that will be read by Hive.  For example, your 
MR program would set the output format to SequenceFile and Text values.

conf.setOutputFormat(SequenceFileOutputFormat.class); 
conf.setOutputValueClass(Text.class);

The reducer (or mapper if no reducer) would send values to the collector with 
Control-A delimiters between column values.  There are no special formats for 
numbers, for example, in this approach.  For example,

output.collect(dummy, col1+ "\001" + col2)

In Hive, create your table with "STORED AS SEQUENCEFILE" and you should be 
golden.

You can presumably use one of the alternative serializers in your MR program, 
but I haven't tried it, yet.

-d

On Apr 19, 2012, at 8:52 AM, David Kulp wrote:

> But I'm not clear on how to write a single row of multiple values in my MR 
> program, since my only way to output data is to send values to the collector. 
>  Are you saying that there's no row delimiter and I simply make repeated 
> calls to the collector, e.g.
> 
> output.collect(null, row1col1)
> output.collect(null, row1col2)
> ...
> output.collect(null, row2col1)
> output.collect(null, row2col2)
> 
> If that's the case, then there's no explicit row boundary in the data, which 
> also implies that there's no reliable way to split such a file later when 
> hive does an MR.
> 
> Or is it along the lines of the following?
> 
> ArrayList<Object> row;
> row.add(row1col1);  
> row.add(row1col2);
> output.collect(null, row);
> 
> 
> Thanks in advance!
> 
> 
> 
> On Apr 19, 2012, at 8:21 AM, Ruben de Vries wrote:
> 
>> Hive can handle a sequence file just like a text file, only it omits the key 
>> completely and only uses the value part of it, other than that you won’t 
>> notice the difference between sequence or plain text file
>>  
>> From: David Kulp [mailto:dk...@fiksu.com] 
>> Sent: Thursday, April 19, 2012 2:13 PM
>> To: user@hive.apache.org
>> Subject: Re: using the key from a SequenceFile
>>  
>> I'm trying to achieve something very similar.  I want to write an MR program 
>> that writes results in a record-based sequencefile that would be directly 
>> readable from hive as though it were created using "STORED AS SEQUENCEFILE" 
>> with, say, BinarySortableSerDe.
>>  
>> From this discussion it seems that Hive does not / cannot take advantage of 
>> the key/values in a sequencefile, but rather it requires a value that is 
>> serialized using a SerDe.  Is that right?
>>  
>> If so, does that mean that the right approach is to using the 
>> BinarySortableSerDe to pass the collector a row's worth of data as the 
>> Writable value.  And would Hive "just work" on such data?
>>  
>> If SequencefileOutputFormat is used, will it automatically place sync 
>> markers in the file to allow for file splitting?
>>  
>> Thanks!
>>  
>>  
>> (ps. As an aside, Avro would be better.  Wouldn't it be a huge win for 
>> MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that 
>> read such files?  It seems like there's a natural correspondence between the 
>> richer data representations of an SQL schema and an Avro schema, and there's 
>> already code for working with Avro in MR as input.) 
>>  
>>  
>>  
>> On Apr 19, 2012, at 6:15 AM, madhu phatak wrote:
>> 
>> 
>> Serde will allow you to create custom data from your sequence File 
>> https://cwiki.apache.org/confluence/display/Hive/SerDe 
>> 
>> On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <ruben.devr...@hyves.nl> 
>> wrote:
>> I’m trying to migrate a part of our current hadoop jobs from normal 
>> mapreduce jobs to hive,
>> Previously the data was stored in sequencefiles with the keys containing 
>> valueable data!
>> However if I load the data into a table I loose that key data (or at least I 
>> can’t access it with hive), I want to somehow use the key from the sequence 
>> file in hive.
>>  
>> I know this has come up before since I can find some hints of people needing 
>> it but I can’t seem to find a working solution and since I’m not very good 
>> with java I really can’t get it done myself L.
>> Does anyone have a snippet of something like this working?
>>  
>> I get errors like;
>> ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol
>>     [javac] symbol  : constructor SequenceFileRecordReader()
>>     [javac] location: class 
>> org.apache.hadoop.mapred.SequenceFileRecordReader<K,V>
>>     [javac] public class CustomSeqRecordReader<K, V> extends 
>> SequenceFileRecordReader<K, V> implements RecordReader<K, V> {
>>  
>>  
>> Hope some1 has a snippet or can help me out, would really love to be able to 
>> switch part of our jobs to hive,
>>  
>>  
>> Ruben de Vries
>> 
>> 
>>  
>> -- 
>> https://github.com/zinnia-phatak-dev/Nectar
>> 
>

Re: using the key from a SequenceFile

Reply via email to