Hi David, you can implement a custom InputFormat (extends org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader). The RecordReader will be used to read your documents and from there you can decide which units you will return as records (return by the next() method). You'll still probably need a SerDe that transforms your data into Hive data types using 1:1 mapping.
In this way you can choose only to duplicate your data while your query runs (and possible in the results) to avoid JOIN operations but the raw files will not contain duplicate data. Something like this: CREATE EXTERNAL TABLE IF NOT EXISTS MyTable ( myfield1 STRING, myfield2 INT) PARTITIONED BY (your_partition_if_appliccable STRING) ROW FORMAT SERDE 'quigley.david.myserde' STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'mylocation'; Hope this helps. Br, Petter 2014-04-02 5:45 GMT+02:00 David Quigley <dquigle...@gmail.com>: > We are currently streaming complex documents to hdfs with the hope of > being able to query. Each single document logically breaks down into a set > of individual records. In order to use Hive, we preprocess each input > document into a set of discreet records, which we save on HDFS and create > an external table on top of. > > This approach works, but we end up duplicating a lot of data in the > records. It would be much more efficient to deserialize the document into a > set of records when a query is made. That way, we can just save the raw > documents on HDFS. > > I have looked into writing a cusom SerDe. > > Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true> > *deserialize*(org.apache.hadoop.io.Writable blob) > > It looks like the input record => deserialized record still needs to be a > 1:1 relationship. Is there any way to deserialize a record into multiple > records? > > Thanks, > Dave >