Re: Writing Custom Serdes for Hive

John Omernik Tue, 16 Oct 2012 09:30:35 -0700

AWESOME This is exactly what we were looking for. Sorry that I was looking
in the wrong spot!




On Tue, Oct 16, 2012 at 11:09 AM, shrikanth shankar <sshan...@qubole.com>wrote:

> I think what you need is a custom Input Format/ Record Reader. By the time
> the SerDe is called the row has been fetched. I believe the record reader
> can get access to predicates. The code to access HBase from Hive needs it
> for the same reasons as you would need with Mongo and might be a good place
> to start.
>
> thanks,
> Shrikanth
>
> On Oct 16, 2012, at 8:54 AM, John Omernik wrote:
>
> There reason I am asking (and maybe YC reads this list and can chime in)
> but he has written a connector for MongoDB.  It's simple, basically it
> connects to a MongoDB, maps columns (primitives only) to mongodb fields,
> and allows you to select out of Mongo. Pretty sweet actually, and with
> Mongo, things are really fast for small tables.
>
>
> That being said, I noticed that his connector basically gets all rows from
> a Mongo DB collection every time it's ran.  And we wanted to see if we
> could extend it to do some simple MongoDB level filtering based on the
> passed query.  Basically have a fail open approach... if it saw something
> it thought it could optimize in the mongodb query to limit data, it would,
> otherwise, it would default to the original approach of getting all the
> data.
>
>
> For example:
>
> select * from mongo_table where name rlike 'Bobby\\sWhite'
>
> Current method: the connection do db.collection.find() gets all the
> documents from MongoDB, and then hive does the regex.
>
> Thing we want to try "Oh one of our defined mongo columns has a rlike, ok
> send this instead: db.collection.find("name":/Bobby\sWhite");   less data
> that would need to be transfered. Yes, Hive would still run the rlike on
> the data... "shrug" at least it's running it on far less data.   Basically
> if we could determine shortcuts, we could use them.
>
>
> Just trying to understand Serdes and how we are completely not using them
> as intended :)
>
>
>
>
> On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck <chuck.conn...@nuance.com
> > wrote:
>
>>  A serde is actually used the other way around… Hive parses the query,
>> writes MapReduce code to solve the query, and the generated code uses the
>> serde for field access.****
>>
>> ** **
>>
>> Standard way to write a serde is to start from the trunk regex serde,
>> then modify as needed…****
>>
>> ** **
>>
>>
>> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup
>>
>> ****
>>
>> Also, nice article by Roberto Congiu…****
>>
>> ** **
>>
>> http://www.congiu.com/a-json-readwrite-serde-for-hive/****
>>
>> ** **
>>
>> Chuck Connell****
>>
>> Nuance R&D Data Team****
>>
>> Burlington, MA****
>>
>> ** **
>>
>> ** **
>>
>> *From:* John Omernik [mailto:j...@omernik.com]
>> *Sent:* Tuesday, October 16, 2012 11:30 AM
>> *To:* user@hive.apache.org
>> *Subject:* Writing Custom Serdes for Hive****
>>
>> ** **
>>
>> We have a maybe obvious question about a serde. When a serde in invoked,
>> does it have access to the original hive query?  Ideally the original query
>> could provide the Serde some hints on how to access the data on the
>> backend.  ****
>>
>> ** **
>>
>> Also, are there any good links/documention on how to write Serdes?  Kinda
>> hard to google on for some reason. ****
>>
>> ** **
>>
>> ** **
>>
>
>
>

Re: Writing Custom Serdes for Hive

Reply via email to