Thanks Sergey for that. Good stuff. I can't speak for everybody here obviously but Hive partition elimination is critical - its gotta happen somehow. However, if JDOQL method isn't robust around the edges i'm fine with finding something better.
So if I get you right you're saying by removing the "optimized path" (getPartitionsByFiltr/JDOQL) the partition elimination logic will default to the "normal path" which is some other kind of filtering. To that i guess i'd have to say what's the risk? It's a little slower? Thanks for your patience, Sergey! On Tue, Aug 27, 2013 at 10:35 AM, Sergey Shelukhin <ser...@hortonworks.com>wrote: > This method is used to prune partitions for the job (separately from > actually processing data). > There are a few ways to get partitions from Hive for a query (to avoid > reading all partitions when filtering involves partition columns) - > get-by-filter that I want to modify is one of them. Hive itself uses it as > a perf optimization; the normal path gets all partition column values (via > partition names) and applies the filter locally, whereas the optimized path > converts the filter to JDOQL for DataNucleus (that Hive metastore uses > internally), which converts it to SQL queries for e.g. MySQL. This normally > happens before MR job is even run. > > Hive uses the latter (JDOQL pushdown) path for a restricted set of filters. > These are enforced in Hive metastore client, not server; the server > supports a wider set of filters, but Hive itself doesn't use them. While > trying to enable Hive to use a wider set I noticed that the LIKE filter > doesn't work properly - both regex and indexOf/... functions in DN seem to > have some weird edge cases. It may be sending some things directly to > datastore which would not actually work. > However they would work for simple regexes (definition of simple is not > clear and may be not the same for all datastores). > > Given that there's normal path to filter partitions in hive client and > pre-job perf optimization for like is not that important, I want to remove > this for Hive, > I assume that other products using this path must apply filtering on client > too sometimes (because getPartitionsByFilter doesn't support all filters > even on server, e.g. such operators as not, between, etc.). > > On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <sprag...@gmail.com> > wrote: > > > sorry to be dumb-ass but what does that translate into in the HSQL > dialect? > > > > Judging from the name you use, getPartitionsByFilter, you're saying you > > want to remove the use case of using like clause on a partition column? > > > > if so, um, yeah, i would think that's surely used. > > > > > > > > On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin < > ser...@hortonworks.com > > >wrote: > > > > > Adding user list. Any objections to removing LIKE support from > > > getPartitionsByFilter? > > > > > > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan < > hashut...@apache.org > > > >wrote: > > > > > > > Couple of questions: > > > > > > > > 1. What about LIKE operator for Hive itself? Will that continue to > work > > > > (presumably because there is an alternative path for that). > > > > 2. This will nonetheless break other direct consumers of metastore > > client > > > > api (like HCatalog). > > > > > > > > I see your point that we have a buggy implementation, so whats out > > there > > > is > > > > not safe to use. Question than really is shall we remove this code, > > > thereby > > > > breaking people for whom current buggy implementation is good enough > > (or > > > > you can say salvage them from breaking in future). Or shall we try to > > fix > > > > it now? > > > > My take is if there are no users of this anyways, then there is no > > point > > > > fixing it for non-existing users, but if there are we probably have > > to. I > > > > will suggest you to send an email to users@hive to ask if there are > > > users > > > > for this. > > > > > > > > Thanks, > > > > Ashutosh > > > > > > > > > > > > > > > > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin < > > > ser...@hortonworks.com > > > > >wrote: > > > > > > > > > Since there's no response I am assuming nobody cares about this > > code... > > > > > Jira is HIVE-5134, I will attach a patch with removal this week. > > > > > > > > > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin < > > > > ser...@hortonworks.com > > > > > >wrote: > > > > > > > > > > > Hi. > > > > > > > > > > > > I think there are issues with the way hive can currently do LIKE > > > > > > operator JDO pushdown and it the code should be removed for > > > partitions > > > > > > and tables. > > > > > > Are there objections to removing LIKE from Filter.g and related > > > areas? > > > > > > If no I will file a JIRA and do it. > > > > > > > > > > > > Details: > > > > > > There's code in metastore that is capable of pushing down LIKE > > > > > > expression into JDO for string partition keys, as well as tables. > > > > > > The code for tables doesn't appear used, and partition code > > > definitely > > > > > > doesn't run in Hive proper because metastore client doesn't send > > LIKE > > > > > > expressions to server. It may be used in e.g. HCat and other > > places, > > > > > > but after asking some people here, I found out it probably isn't. > > > > > > I was trying to make it run and noticed some problems: > > > > > > 1) For partitions, Hive sends SQL patterns in a filter for like, > > e.g. > > > > > > "%foo%", whereas metastore passes them into matches() JDOQL > method > > > > > > which expects Java regex. > > > > > > 2) Converting the pattern to Java regex via UDFLike method, I > found > > > > > > out that not all regexes appear to work in DN. ".*foo" seems to > > work > > > > > > but anything complex (such as escaping the pattern using > > > > > > Pattern.quote, which UDFLike does) breaks and no longer matches > > > > > > properly. > > > > > > 3) I tried to implement common cases using JDO methods > > > > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run > > > tests > > > > > > on Derby, they also appear to have problems with some strings > (for > > > > > > example, partition with backslash in the name cannot be matched > by > > > > > > LIKE "%\%" (single backslash in a string), after being converted > to > > > > > > .indexOf(param) where param is "\" (escaping the backslash once > > again > > > > > > doesn't work either, and anyway there's no documented reason why > it > > > > > > shouldn't work properly), while other characters match correctly, > > > even > > > > > > e.g. "%". > > > > > > > > > > > > For tables, there's no SQL-like, it expects Java regex, but I am > > not > > > > > > convinced all Java regexes are going to work. > > > > > > > > > > > > So, I think that for future correctness sake it's better to > remove > > > this > > > > > > code. > > > > > > > > > > > > > > > > -- > > > > > CONFIDENTIALITY NOTICE > > > > > NOTICE: This message is intended for the use of the individual or > > > entity > > > > to > > > > > which it is addressed and may contain information that is > > confidential, > > > > > privileged and exempt from disclosure under applicable law. If the > > > reader > > > > > of this message is not the intended recipient, you are hereby > > notified > > > > that > > > > > any printing, copying, dissemination, distribution, disclosure or > > > > > forwarding of this communication is strictly prohibited. If you > have > > > > > received this communication in error, please contact the sender > > > > immediately > > > > > and delete it from your system. Thank You. > > > > > > > > > > > > > > > -- > > > CONFIDENTIALITY NOTICE > > > NOTICE: This message is intended for the use of the individual or > entity > > to > > > which it is addressed and may contain information that is confidential, > > > privileged and exempt from disclosure under applicable law. If the > reader > > > of this message is not the intended recipient, you are hereby notified > > that > > > any printing, copying, dissemination, distribution, disclosure or > > > forwarding of this communication is strictly prohibited. If you have > > > received this communication in error, please contact the sender > > immediately > > > and delete it from your system. Thank You. > > > > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >