To answer my own question -- so that someone else may benefit some day -- I've found that there is nothing special about key or value formats in a SequenceFile. As has been noted, keys are ignored. Each new key/value pair is seen as a new row from Hive's perspective. There's no concept of using Writables, such as ArrayWritable, to create nested structures in a value field that are automatically parsed by Hive. There are no record delimiters known to SequenceFile. There's just an ignored key and a value that is just a byte stream.
Thus, the simplest approach is just to use the Lazy SerDe format to create a multi-column row in an MR program that will be read by Hive. For example, your MR program would set the output format to SequenceFile and Text values. conf.setOutputFormat(SequenceFileOutputFormat.class); conf.setOutputValueClass(Text.class); The reducer (or mapper if no reducer) would send values to the collector with Control-A delimiters between column values. There are no special formats for numbers, for example, in this approach. For example, output.collect(dummy, col1+ "\001" + col2) In Hive, create your table with "STORED AS SEQUENCEFILE" and you should be golden. You can presumably use one of the alternative serializers in your MR program, but I haven't tried it, yet. -d On Apr 19, 2012, at 8:52 AM, David Kulp wrote: > But I'm not clear on how to write a single row of multiple values in my MR > program, since my only way to output data is to send values to the collector. > Are you saying that there's no row delimiter and I simply make repeated > calls to the collector, e.g. > > output.collect(null, row1col1) > output.collect(null, row1col2) > ... > output.collect(null, row2col1) > output.collect(null, row2col2) > > If that's the case, then there's no explicit row boundary in the data, which > also implies that there's no reliable way to split such a file later when > hive does an MR. > > Or is it along the lines of the following? > > ArrayList<Object> row; > row.add(row1col1); > row.add(row1col2); > output.collect(null, row); > > > Thanks in advance! > > > > On Apr 19, 2012, at 8:21 AM, Ruben de Vries wrote: > >> Hive can handle a sequence file just like a text file, only it omits the key >> completely and only uses the value part of it, other than that you won’t >> notice the difference between sequence or plain text file >> >> From: David Kulp [mailto:dk...@fiksu.com] >> Sent: Thursday, April 19, 2012 2:13 PM >> To: user@hive.apache.org >> Subject: Re: using the key from a SequenceFile >> >> I'm trying to achieve something very similar. I want to write an MR program >> that writes results in a record-based sequencefile that would be directly >> readable from hive as though it were created using "STORED AS SEQUENCEFILE" >> with, say, BinarySortableSerDe. >> >> From this discussion it seems that Hive does not / cannot take advantage of >> the key/values in a sequencefile, but rather it requires a value that is >> serialized using a SerDe. Is that right? >> >> If so, does that mean that the right approach is to using the >> BinarySortableSerDe to pass the collector a row's worth of data as the >> Writable value. And would Hive "just work" on such data? >> >> If SequencefileOutputFormat is used, will it automatically place sync >> markers in the file to allow for file splitting? >> >> Thanks! >> >> >> (ps. As an aside, Avro would be better. Wouldn't it be a huge win for >> MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that >> read such files? It seems like there's a natural correspondence between the >> richer data representations of an SQL schema and an Avro schema, and there's >> already code for working with Avro in MR as input.) >> >> >> >> On Apr 19, 2012, at 6:15 AM, madhu phatak wrote: >> >> >> Serde will allow you to create custom data from your sequence File >> https://cwiki.apache.org/confluence/display/Hive/SerDe >> >> On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <ruben.devr...@hyves.nl> >> wrote: >> I’m trying to migrate a part of our current hadoop jobs from normal >> mapreduce jobs to hive, >> Previously the data was stored in sequencefiles with the keys containing >> valueable data! >> However if I load the data into a table I loose that key data (or at least I >> can’t access it with hive), I want to somehow use the key from the sequence >> file in hive. >> >> I know this has come up before since I can find some hints of people needing >> it but I can’t seem to find a working solution and since I’m not very good >> with java I really can’t get it done myself L. >> Does anyone have a snippet of something like this working? >> >> I get errors like; >> ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol >> [javac] symbol : constructor SequenceFileRecordReader() >> [javac] location: class >> org.apache.hadoop.mapred.SequenceFileRecordReader<K,V> >> [javac] public class CustomSeqRecordReader<K, V> extends >> SequenceFileRecordReader<K, V> implements RecordReader<K, V> { >> >> >> Hope some1 has a snippet or can help me out, would really love to be able to >> switch part of our jobs to hive, >> >> >> Ruben de Vries >> >> >> >> -- >> https://github.com/zinnia-phatak-dev/Nectar >> >