AWESOME This is exactly what we were looking for. Sorry that I was looking in the wrong spot!
On Tue, Oct 16, 2012 at 11:09 AM, shrikanth shankar <sshan...@qubole.com>wrote: > I think what you need is a custom Input Format/ Record Reader. By the time > the SerDe is called the row has been fetched. I believe the record reader > can get access to predicates. The code to access HBase from Hive needs it > for the same reasons as you would need with Mongo and might be a good place > to start. > > thanks, > Shrikanth > > On Oct 16, 2012, at 8:54 AM, John Omernik wrote: > > There reason I am asking (and maybe YC reads this list and can chime in) > but he has written a connector for MongoDB. It's simple, basically it > connects to a MongoDB, maps columns (primitives only) to mongodb fields, > and allows you to select out of Mongo. Pretty sweet actually, and with > Mongo, things are really fast for small tables. > > > That being said, I noticed that his connector basically gets all rows from > a Mongo DB collection every time it's ran. And we wanted to see if we > could extend it to do some simple MongoDB level filtering based on the > passed query. Basically have a fail open approach... if it saw something > it thought it could optimize in the mongodb query to limit data, it would, > otherwise, it would default to the original approach of getting all the > data. > > > For example: > > select * from mongo_table where name rlike 'Bobby\\sWhite' > > Current method: the connection do db.collection.find() gets all the > documents from MongoDB, and then hive does the regex. > > Thing we want to try "Oh one of our defined mongo columns has a rlike, ok > send this instead: db.collection.find("name":/Bobby\sWhite"); less data > that would need to be transfered. Yes, Hive would still run the rlike on > the data... "shrug" at least it's running it on far less data. Basically > if we could determine shortcuts, we could use them. > > > Just trying to understand Serdes and how we are completely not using them > as intended :) > > > > > On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck <chuck.conn...@nuance.com > > wrote: > >> A serde is actually used the other way around… Hive parses the query, >> writes MapReduce code to solve the query, and the generated code uses the >> serde for field access.**** >> >> ** ** >> >> Standard way to write a serde is to start from the trunk regex serde, >> then modify as needed…**** >> >> ** ** >> >> >> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup >> >> **** >> >> Also, nice article by Roberto Congiu…**** >> >> ** ** >> >> http://www.congiu.com/a-json-readwrite-serde-for-hive/**** >> >> ** ** >> >> Chuck Connell**** >> >> Nuance R&D Data Team**** >> >> Burlington, MA**** >> >> ** ** >> >> ** ** >> >> *From:* John Omernik [mailto:j...@omernik.com] >> *Sent:* Tuesday, October 16, 2012 11:30 AM >> *To:* user@hive.apache.org >> *Subject:* Writing Custom Serdes for Hive**** >> >> ** ** >> >> We have a maybe obvious question about a serde. When a serde in invoked, >> does it have access to the original hive query? Ideally the original query >> could provide the Serde some hints on how to access the data on the >> backend. **** >> >> ** ** >> >> Also, are there any good links/documention on how to write Serdes? Kinda >> hard to google on for some reason. **** >> >> ** ** >> >> ** ** >> > > >