Good stuff! I am glad that I could help.
Br, Petter 2014-04-04 6:02 GMT+02:00 David Quigley <dquigle...@gmail.com>: > Thanks again Petter, the custom input format was exactly what I needed. > > Here is example of my code in case anyone is interested > https://github.com/quicklyNotQuigley/nest > > Basically gives you SQL access to arbitrary json data. I know there are > solutions for dealing with JSON data in hive fields but nothing I saw > actually decomposes nested JSON into a set of discreet records. Its super > useful for us. > > > On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) < > petter.von.dolw...@gmail.com> wrote: > >> Hi David, >> >> you can implement a custom InputFormat (extends >> org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom >> RecordReader (implements org.apache.hadoop.mapred.RecordReader). The >> RecordReader will be used to read your documents and from there you can >> decide which units you will return as records (return by the next() >> method). You'll still probably need a SerDe that transforms your data into >> Hive data types using 1:1 mapping. >> >> In this way you can choose only to duplicate your data while your query >> runs (and possible in the results) to avoid JOIN operations but the raw >> files will not contain duplicate data. >> >> Something like this: >> >> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable ( >> myfield1 STRING, >> myfield2 INT) >> PARTITIONED BY (your_partition_if_appliccable STRING) >> ROW FORMAT SERDE 'quigley.david.myserde' >> STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT >> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' >> LOCATION 'mylocation'; >> >> >> Hope this helps. >> >> Br, >> Petter >> >> >> >> >> 2014-04-02 5:45 GMT+02:00 David Quigley <dquigle...@gmail.com>: >> >> We are currently streaming complex documents to hdfs with the hope of >>> being able to query. Each single document logically breaks down into a set >>> of individual records. In order to use Hive, we preprocess each input >>> document into a set of discreet records, which we save on HDFS and create >>> an external table on top of. >>> >>> This approach works, but we end up duplicating a lot of data in the >>> records. It would be much more efficient to deserialize the document into a >>> set of records when a query is made. That way, we can just save the raw >>> documents on HDFS. >>> >>> I have looked into writing a cusom SerDe. >>> >>> Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true> >>> *deserialize*(org.apache.hadoop.io.Writable blob) >>> >>> It looks like the input record => deserialized record still needs to be >>> a 1:1 relationship. Is there any way to deserialize a record into multiple >>> records? >>> >>> Thanks, >>> Dave >>> >> >> >