Thanks Sergey for that.  Good stuff.

I can't speak for everybody here obviously but Hive partition elimination
is critical - its gotta happen somehow.  However, if JDOQL method isn't
robust around the edges i'm fine with finding something better.

So if I get you right you're saying by removing the "optimized path"
(getPartitionsByFiltr/JDOQL) the partition elimination logic will default
to the "normal path" which is some other kind of filtering.  To that i
guess i'd have to say what's the risk? It's a little slower?

Thanks for your patience, Sergey!






On Tue, Aug 27, 2013 at 10:35 AM, Sergey Shelukhin
<ser...@hortonworks.com>wrote:

> This method is used to prune partitions for the job (separately from
> actually processing data).
> There are a few ways to get partitions from Hive for a query (to avoid
> reading all partitions when filtering involves partition columns)  -
> get-by-filter that I want to modify is one of them. Hive itself uses it as
> a perf optimization; the normal path gets all partition column values (via
> partition names) and applies the filter locally, whereas the optimized path
> converts the filter to JDOQL for DataNucleus (that Hive metastore uses
> internally), which converts it to SQL queries for e.g. MySQL. This normally
> happens before MR job is even run.
>
> Hive uses the latter (JDOQL pushdown) path for a restricted set of filters.
> These are enforced in Hive metastore client, not server; the server
> supports a wider set of filters, but Hive itself doesn't use them. While
> trying to enable Hive to use a wider set I noticed that the LIKE filter
> doesn't work properly - both regex and indexOf/... functions in DN seem to
> have some weird edge cases. It may be sending some things directly to
> datastore which would not actually work.
> However they would work for simple regexes (definition of simple is not
> clear and may be not the same for all datastores).
>
> Given that there's normal path to filter partitions in hive client and
> pre-job perf optimization for like is not that important, I want to remove
> this for Hive,
> I assume that other products using this path must apply filtering on client
> too sometimes (because getPartitionsByFilter doesn't support all filters
> even on server, e.g. such  operators as not, between, etc.).
>
> On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
> > sorry to be dumb-ass but what does that translate into in the HSQL
> dialect?
> >
> > Judging from the name you use, getPartitionsByFilter, you're saying you
> > want to remove the use case of using like clause on a partition column?
> >
> > if so, um, yeah, i would think that's surely used.
> >
> >
> >
> > On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <
> ser...@hortonworks.com
> > >wrote:
> >
> > > Adding user list. Any objections to removing LIKE support from
> > > getPartitionsByFilter?
> > >
> > > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <
> hashut...@apache.org
> > > >wrote:
> > >
> > > > Couple of questions:
> > > >
> > > > 1. What about LIKE operator for Hive itself? Will that continue to
> work
> > > > (presumably because there is an alternative path for that).
> > > > 2. This will nonetheless break other direct consumers of metastore
> > client
> > > > api (like HCatalog).
> > > >
> > > > I see your point that we have a buggy implementation, so whats out
> > there
> > > is
> > > > not safe to use. Question than really is shall we remove this code,
> > > thereby
> > > > breaking people for whom current buggy implementation is good enough
> > (or
> > > > you can say salvage them from breaking in future). Or shall we try to
> > fix
> > > > it now?
> > > > My take is if there are no users of this anyways, then there is no
> > point
> > > > fixing it for non-existing users, but if there are we probably have
> > to. I
> > > > will suggest you to send an email to users@hive to ask if there are
> > > users
> > > > for this.
> > > >
> > > > Thanks,
> > > > Ashutosh
> > > >
> > > >
> > > >
> > > > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <
> > > ser...@hortonworks.com
> > > > >wrote:
> > > >
> > > > > Since there's no response I am assuming nobody cares about this
> > code...
> > > > > Jira is HIVE-5134, I will attach a patch with removal this week.
> > > > >
> > > > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> > > > ser...@hortonworks.com
> > > > > >wrote:
> > > > >
> > > > > > Hi.
> > > > > >
> > > > > > I think there are issues with the way hive can currently do LIKE
> > > > > > operator JDO pushdown and it the code should be removed for
> > > partitions
> > > > > > and tables.
> > > > > > Are there objections to removing LIKE from Filter.g and related
> > > areas?
> > > > > > If no I will file a JIRA and do it.
> > > > > >
> > > > > > Details:
> > > > > > There's code in metastore that is capable of pushing down LIKE
> > > > > > expression into JDO for string partition keys, as well as tables.
> > > > > > The code for tables doesn't appear used, and partition code
> > > definitely
> > > > > > doesn't run in Hive proper because metastore client doesn't send
> > LIKE
> > > > > > expressions to server. It may be used in e.g. HCat and other
> > places,
> > > > > > but after asking some people here, I found out it probably isn't.
> > > > > > I was trying to make it run and noticed some problems:
> > > > > > 1) For partitions, Hive sends SQL patterns in a filter for like,
> > e.g.
> > > > > > "%foo%", whereas metastore passes them into matches() JDOQL
> method
> > > > > > which expects Java regex.
> > > > > > 2) Converting the pattern to Java regex via UDFLike method, I
> found
> > > > > > out that not all regexes appear to work in DN. ".*foo" seems to
> > work
> > > > > > but anything complex (such as escaping the pattern using
> > > > > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > > > > properly.
> > > > > > 3) I tried to implement common cases using JDO methods
> > > > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run
> > > tests
> > > > > > on Derby, they also appear to have problems with some strings
> (for
> > > > > > example, partition with backslash in the name cannot be matched
> by
> > > > > > LIKE "%\%" (single backslash in a string), after being converted
> to
> > > > > > .indexOf(param) where param is "\" (escaping the backslash once
> > again
> > > > > > doesn't work either, and anyway there's no documented reason why
> it
> > > > > > shouldn't work properly), while other characters match correctly,
> > > even
> > > > > > e.g. "%".
> > > > > >
> > > > > > For tables, there's no SQL-like, it expects Java regex, but I am
> > not
> > > > > > convinced all Java regexes are going to work.
> > > > > >
> > > > > > So, I think that for future correctness sake it's better to
> remove
> > > this
> > > > > > code.
> > > > > >
> > > > >
> > > > > --
> > > > > CONFIDENTIALITY NOTICE
> > > > > NOTICE: This message is intended for the use of the individual or
> > > entity
> > > > to
> > > > > which it is addressed and may contain information that is
> > confidential,
> > > > > privileged and exempt from disclosure under applicable law. If the
> > > reader
> > > > > of this message is not the intended recipient, you are hereby
> > notified
> > > > that
> > > > > any printing, copying, dissemination, distribution, disclosure or
> > > > > forwarding of this communication is strictly prohibited. If you
> have
> > > > > received this communication in error, please contact the sender
> > > > immediately
> > > > > and delete it from your system. Thank You.
> > > > >
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Reply via email to