Re: HQL parser internals

Furcy Pin Fri, 13 Apr 2018 06:30:06 -0700

Hi Jay,

I noticed the same thing when I did my tool, and it makes sense are
syntactically they are both equivalent, so the Hive parser does not care.
You could probably update the HiveParser so that it keeps the information
in the AST, but not without breaking every part of the code that reads that
AST I'm afraid.


In the long run, I believe that it would be really valuable if someone had
the courage to completely rewrite Hive parsing using best practices
and good tools (e.g. ANTLR 4), but I doubt that it will ever happen
unfortunately :-(




On 13 April 2018 at 14:25, Jay Green-Stevens <t-jgreenstev...@hotels.com>
wrote:

> Afternoon all,
>
>
>
> We have successfully managed to build a java tool which will translate a
> hive query into a syntax tree, and then turn this back into a hive query
> equivalent to the input.
>
>
>
> But we have found that for a query such as
>
> *select a *
>
> *from (*
>
> *select a *
>
> *from b*
>
> *) as c*
>
>
>
> the hive parser is stripping out the ‘as’ when building the tree. This
> then means that when the query string is rebuilt the output is ‘*select a
> from (select a from b) c’*, and although this is technically valid it is
> not equivalent to the input query.
>
>
>
> Is there any way we can fix this issue?
>
> Could we change the parser in some way to stop the ‘as’ from being
> stripped out?
>
>
>
> Any ideas would be greatly appreciated!
>
>
>
> Thanks,
>
>
>
> Jay
>
>
>
>
>
>
>
>
> From: *Elliot West* <tea...@gmail.com>
> Date: 19 March 2018 at 21:33
> Subject: Re: HQL parser internals
> To: user@hive.apache.org
>
> Hello again,
>
>
>
> We're now testing our system against a corpus of Hive SQL statements in an
> effort to quickly highlight edge cases, limitations etc. We're finding that
> org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
> as ${hiveconf:varname}. Are variable substitutions handled prior to
> parsing or within the parser itself? If in a pre-procesing stage, is there
> any code or utility classes within Hive that we can use as a reference, or
> to provide this functionality?
>
>
>
> Cheers,
>
>
>
> Elliot.
>
>
>
> On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote:
>
> Thank you all for your rapid responses; some really useful information and
> pointers in there.
>
>
>
> We'll keep the list updated with our progress.
>
>
>
> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com>
> wrote:
>
> +1 for using ParseDriver for this. I also have used it to intercept and
> augment query AST.
>
>
>
> Also, I would echo others sentiment that its quite ugly. It would be great
> if we can refactor/standardize this. That will make integrating other
> system a lot easier.
>
>
>
> Thanks,
>
> Dharmesh
>
>
>
> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote:
>
> Hi Elliot,
>
>
>
> Actually, I have done quite similar work regarding Hive custom Parsing,
> you should have a look at my project: https://github.com/flaminem/flamy
>
>
>
> The Hive parsing related stuff is here: https://github.com/
> flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
> A good starting point to see how to parse queries is here:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>
>
>
>
>
> Basically, all you need is to use a org.apache.hadoop.hive.ql.
> parse.ParseDriver.
>
>
>
> val pd: ParseDriver = new ParseDriver
>
> val tree: ASTNode = pd.parse(query, hiveContext)
>
> You then get the ASTNode, that you can freely parse and change.
>
> Also, I must say that it is quite ugly to manipulate, and the Presto
> Parser seems to be much better designed (but it is not the same syntax,
> unfortunately),
>
> I recommend to look at it to get better design ideas.
>
>
>
>
>
> If you want to enrich your Hive syntax like I did (I wanted to be able to
> parse ${VARS} in queries),
>
> you will not be able to use the HiveParser without some workaround.
>
> What I did was replacing these ${VARS} by strings "${VARS}" that the
> HiveParser would agree to parse,
>
> and that I could recognize afterwards...
>
>
>
> Also, if you are familiar with Scala, I recommend using it, it helps a
> lot...
>
>
>
> For instance, I have this class that transforms an AST back into a string
> query:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>
> I could never have done something that good looking in Java...
>
>
>
> Finally this method helps a lot to understand how the hell the AST works:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>
>
>
> Make sure to write *tons* of unit tests too, you'll need them.
>
>
>
> Hope this helps,
>
>
>
> Furcy
>
>
>
>
>
>
>
> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>
> > However, ideally we wish to manipulate the original query as delivered
> by the user (or as close to it as possible), and we’re finding that the
> tree has been modified significantly by the time it hits the hook
>
> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
> the bushy join conversion is already done by the time the hook gets called.
>
> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
> hook.
>
> > Additionally we wish to track back ASTNodes to the character sequences
> in the source HQL that were their origin (where sensible), and ultimately
> hope to be able regenerate the query text from the AST.
>
> I started work on a Hive-unparser a while back based on this class, but it
> a world of verbose coding.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/calcite/translator/
> ASTConverter.java#L850
>
> If you're doing active work on this, I'd like to help, because I need the
> AST -> query to debug CBO.
>
> > The use case, if you are interested, is a mutation testing framework for
> HQL. The testing of mutants is operational, but now we need to report on
> survivors, hence the need to track back from specific query elements to
> character sequences in the original query string.
>
> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
> have Impala vs Hive bug-for-bug compat.
>
> https://cwiki.apache.org/confluence/download/attachments/27362054/Random%
> 20Query%20Gen-%20Hive%20Meetup.pptx
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>

Re: HQL parser internals

Reply via email to