Re: HQL parser internals

Pengcheng Xiong Fri, 13 Apr 2018 10:10:39 -0700

Hi Jay,

     Yes, Hive parser is stripping out the "as". As Gopal explained in the
earlier thread, parser is used to translate a SQL string into an AST, i.e.,
an AST that the compiler can understand. That said, as long as the
following compiler knows that "oh, this is a renaming", it is OK. In
another word, it is OK if you want to add the "as" back as long as your
compiler understands. In Hive's case, IMHO, I do not think it is necessary
to add it back because the compiler already knows what it means.


Best
Pengcheng

On Fri, Apr 13, 2018 at 6:29 AM, Furcy Pin <pin.fu...@gmail.com> wrote:

> Hi Jay,
>
> I noticed the same thing when I did my tool, and it makes sense are
> syntactically they are both equivalent, so the Hive parser does not care.
> You could probably update the HiveParser so that it keeps the information
> in the AST, but not without breaking every part of the code that reads that
> AST I'm afraid.
>
> In the long run, I believe that it would be really valuable if someone had
> the courage to completely rewrite Hive parsing using best practices
> and good tools (e.g. ANTLR 4), but I doubt that it will ever happen
> unfortunately :-(
>
>
>
>
> On 13 April 2018 at 14:25, Jay Green-Stevens <t-jgreenstev...@hotels.com>
> wrote:
>
>> Afternoon all,
>>
>>
>>
>> We have successfully managed to build a java tool which will translate a
>> hive query into a syntax tree, and then turn this back into a hive query
>> equivalent to the input.
>>
>>
>>
>> But we have found that for a query such as
>>
>> *select a *
>>
>> *from (*
>>
>> *select a *
>>
>> *from b*
>>
>> *) as c*
>>
>>
>>
>> the hive parser is stripping out the ‘as’ when building the tree. This
>> then means that when the query string is rebuilt the output is ‘*select
>> a from (select a from b) c’*, and although this is technically valid it
>> is not equivalent to the input query.
>>
>>
>>
>> Is there any way we can fix this issue?
>>
>> Could we change the parser in some way to stop the ‘as’ from being
>> stripped out?
>>
>>
>>
>> Any ideas would be greatly appreciated!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jay
>>
>>
>>
>>
>>
>>
>>
>>
>> From: *Elliot West* <tea...@gmail.com>
>> Date: 19 March 2018 at 21:33
>> Subject: Re: HQL parser internals
>> To: user@hive.apache.org
>>
>> Hello again,
>>
>>
>>
>> We're now testing our system against a corpus of Hive SQL statements in
>> an effort to quickly highlight edge cases, limitations etc. We're finding
>> that org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on
>> variables such as ${hiveconf:varname}. Are variable substitutions
>> handled prior to parsing or within the parser itself? If in a pre-procesing
>> stage, is there any code or utility classes within Hive that we can use as
>> a reference, or to provide this functionality?
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Elliot.
>>
>>
>>
>> On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote:
>>
>> Thank you all for your rapid responses; some really useful information
>> and pointers in there.
>>
>>
>>
>> We'll keep the list updated with our progress.
>>
>>
>>
>> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com>
>> wrote:
>>
>> +1 for using ParseDriver for this. I also have used it to intercept and
>> augment query AST.
>>
>>
>>
>> Also, I would echo others sentiment that its quite ugly. It would be
>> great if we can refactor/standardize this. That will make integrating
>> other system a lot easier.
>>
>>
>>
>> Thanks,
>>
>> Dharmesh
>>
>>
>>
>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote:
>>
>> Hi Elliot,
>>
>>
>>
>> Actually, I have done quite similar work regarding Hive custom Parsing,
>> you should have a look at my project: https://github.com/flaminem/flamy
>>
>>
>>
>> The Hive parsing related stuff is here: https://github.com/flami
>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>> A good starting point to see how to parse queries is here:
>>
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>
>>
>>
>>
>>
>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>> e.ParseDriver.
>>
>>
>>
>> val pd: ParseDriver = new ParseDriver
>>
>> val tree: ASTNode = pd.parse(query, hiveContext)
>>
>> You then get the ASTNode, that you can freely parse and change.
>>
>> Also, I must say that it is quite ugly to manipulate, and the Presto
>> Parser seems to be much better designed (but it is not the same syntax,
>> unfortunately),
>>
>> I recommend to look at it to get better design ideas.
>>
>>
>>
>>
>>
>> If you want to enrich your Hive syntax like I did (I wanted to be able to
>> parse ${VARS} in queries),
>>
>> you will not be able to use the HiveParser without some workaround.
>>
>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>> HiveParser would agree to parse,
>>
>> and that I could recognize afterwards...
>>
>>
>>
>> Also, if you are familiar with Scala, I recommend using it, it helps a
>> lot...
>>
>>
>>
>> For instance, I have this class that transforms an AST back into a string
>> query:
>>
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>
>> I could never have done something that good looking in Java...
>>
>>
>>
>> Finally this method helps a lot to understand how the hell the AST works:
>>
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>
>>
>>
>> Make sure to write *tons* of unit tests too, you'll need them.
>>
>>
>>
>> Hope this helps,
>>
>>
>>
>> Furcy
>>
>>
>>
>>
>>
>>
>>
>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org>
>> wrote:
>>
>>
>> > However, ideally we wish to manipulate the original query as delivered
>> by the user (or as close to it as possible), and we’re finding that the
>> tree has been modified significantly by the time it hits the hook
>>
>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>> the bushy join conversion is already done by the time the hook gets called.
>>
>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>> hook.
>>
>> > Additionally we wish to track back ASTNodes to the character sequences
>> in the source HQL that were their origin (where sensible), and ultimately
>> hope to be able regenerate the query text from the AST.
>>
>> I started work on a Hive-unparser a while back based on this class, but
>> it a world of verbose coding.
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/
>> apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850
>>
>> If you're doing active work on this, I'd like to help, because I need the
>> AST -> query to debug CBO.
>>
>> > The use case, if you are interested, is a mutation testing framework
>> for HQL. The testing of mutants is operational, but now we need to report
>> on survivors, hence the need to track back from specific query elements to
>> character sequences in the original query string.
>>
>> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
>> have Impala vs Hive bug-for-bug compat.
>>
>> https://cwiki.apache.org/confluence/download/attachments/
>> 27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: HQL parser internals

Reply via email to