Re: Deserializing into multiple records

Petter von Dolwitz (Hem) Wed, 02 Apr 2014 02:16:37 -0700

Hi David,

you can implement a custom InputFormat (extends
org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
RecordReader will be used to read your documents and from there you can
decide which units you will return as records (return by the next()
method). You'll still probably need a SerDe that transforms your data into
Hive data types using 1:1 mapping.


In this way you can choose only to duplicate your data while your query
runs (and possible in the results) to avoid JOIN operations but the raw
files will not contain duplicate data.

Something like this:

CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
  myfield1 STRING,
  myfield2 INT)
  PARTITIONED BY (your_partition_if_appliccable STRING)
  ROW FORMAT SERDE 'quigley.david.myserde'
  STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION 'mylocation';


Hope this helps.

Br,
Petter




2014-04-02 5:45 GMT+02:00 David Quigley <dquigle...@gmail.com>:

> We are currently streaming complex documents to hdfs with the hope of
> being able to query. Each single document logically breaks down into a set
> of individual records. In order to use Hive, we preprocess each input
> document into a set of discreet records, which we save on HDFS and create
> an external table on top of.
>
> This approach works, but we end up duplicating a lot of data in the
> records. It would be much more efficient to deserialize the document into a
> set of records when a query is made. That way, we can just save the raw
> documents on HDFS.
>
> I have looked into writing a cusom SerDe.
>
> Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>  *deserialize*(org.apache.hadoop.io.Writable blob)
>
> It looks like the input record => deserialized record still needs to be a
> 1:1 relationship. Is there any way to deserialize a record into multiple
> records?
>
> Thanks,
> Dave
>

Re: Deserializing into multiple records

Reply via email to