Re: Deserializing into multiple records

Petter von Dolwitz (Hem) Tue, 08 Apr 2014 04:55:34 -0700

Good stuff!

I am glad that I could help.


Br,
Petter


2014-04-04 6:02 GMT+02:00 David Quigley <dquigle...@gmail.com>:

> Thanks again Petter, the custom input format was exactly what I needed.
>
> Here is example of my code in case anyone is interested
> https://github.com/quicklyNotQuigley/nest
>
> Basically gives you SQL access to arbitrary json data. I know there are
> solutions for dealing with JSON data in hive fields but nothing I saw
> actually decomposes nested JSON into a set of discreet records. Its super
> useful for us.
>
>
> On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) <
> petter.von.dolw...@gmail.com> wrote:
>
>> Hi David,
>>
>> you can implement a custom InputFormat (extends
>> org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
>> RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
>> RecordReader will be used to read your documents and from there you can
>> decide which units you will return as records (return by the next()
>> method). You'll still probably need a SerDe that transforms your data into
>> Hive data types using 1:1 mapping.
>>
>> In this way you can choose only to duplicate your data while your query
>> runs (and possible in the results) to avoid JOIN operations but the raw
>> files will not contain duplicate data.
>>
>> Something like this:
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
>>   myfield1 STRING,
>>   myfield2 INT)
>>   PARTITIONED BY (your_partition_if_appliccable STRING)
>>   ROW FORMAT SERDE 'quigley.david.myserde'
>>   STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
>> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>   LOCATION 'mylocation';
>>
>>
>> Hope this helps.
>>
>> Br,
>> Petter
>>
>>
>>
>>
>> 2014-04-02 5:45 GMT+02:00 David Quigley <dquigle...@gmail.com>:
>>
>> We are currently streaming complex documents to hdfs with the hope of
>>> being able to query. Each single document logically breaks down into a set
>>> of individual records. In order to use Hive, we preprocess each input
>>> document into a set of discreet records, which we save on HDFS and create
>>> an external table on top of.
>>>
>>> This approach works, but we end up duplicating a lot of data in the
>>> records. It would be much more efficient to deserialize the document into a
>>> set of records when a query is made. That way, we can just save the raw
>>> documents on HDFS.
>>>
>>> I have looked into writing a cusom SerDe.
>>>
>>> Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>>>  *deserialize*(org.apache.hadoop.io.Writable blob)
>>>
>>> It looks like the input record => deserialized record still needs to be
>>> a 1:1 relationship. Is there any way to deserialize a record into multiple
>>> records?
>>>
>>> Thanks,
>>> Dave
>>>
>>
>>
>

Re: Deserializing into multiple records

Reply via email to