Re: HQL parser internals

Furcy Pin Tue, 20 Mar 2018 00:29:47 -0700

Hi Elliot,

Yes, the variable substitution is done before the parsing. This make
generic query validation much more complicated.


As I explained in my previous message, what I did was replacing these
${VARS} by strings "${VARS}" that the HiveParser would agree to parse,
and that I could recognize afterwards...



On 19 March 2018 at 22:33, Elliot West <tea...@gmail.com> wrote:

> Hello again,
>
> We're now testing our system against a corpus of Hive SQL statements in an
> effort to quickly highlight edge cases, limitations etc. We're finding that
> org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
> as ${hiveconf:varname}. Are variable substitutions handled prior to
> parsing or within the parser itself? If in a pre-procesing stage, is there
> any code or utility classes within Hive that we can use as a reference, or
> to provide this functionality?
>
> Cheers,
>
> Elliot.
>
> On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote:
>
>> Thank you all for your rapid responses; some really useful information
>> and pointers in there.
>>
>> We'll keep the list updated with our progress.
>>
>> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com>
>> wrote:
>>
>>> +1 for using ParseDriver for this. I also have used it to intercept and
>>> augment query AST.
>>>
>>> Also, I would echo others sentiment that its quite ugly. It would be
>>> great if we can refactor/standardize this. That will make integrating
>>> other system a lot easier.
>>>
>>> Thanks,
>>> Dharmesh
>>>
>>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote:
>>>
>>>> Hi Elliot,
>>>>
>>>> Actually, I have done quite similar work regarding Hive custom Parsing,
>>>> you should have a look at my project: https://github.com/flaminem/flamy
>>>>
>>>> The Hive parsing related stuff is here: https://github.com/flami
>>>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>>>> A good starting point to see how to parse queries is here:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>>>
>>>>
>>>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>>>> e.ParseDriver.
>>>>
>>>> val pd: ParseDriver = new ParseDriver
>>>> val tree: ASTNode = pd.parse(query, hiveContext)
>>>>
>>>> You then get the ASTNode, that you can freely parse and change.
>>>> Also, I must say that it is quite ugly to manipulate, and the Presto
>>>> Parser seems to be much better designed (but it is not the same syntax,
>>>> unfortunately),
>>>> I recommend to look at it to get better design ideas.
>>>>
>>>>
>>>> If you want to enrich your Hive syntax like I did (I wanted to be able
>>>> to parse ${VARS} in queries),
>>>> you will not be able to use the HiveParser without some workaround.
>>>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>>>> HiveParser would agree to parse,
>>>> and that I could recognize afterwards...
>>>>
>>>> Also, if you are familiar with Scala, I recommend using it, it helps a
>>>> lot...
>>>>
>>>> For instance, I have this class that transforms an AST back into a
>>>> string query:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>>> I could never have done something that good looking in Java...
>>>>
>>>> Finally this method helps a lot to understand how the hell the AST
>>>> works:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>>>
>>>> Make sure to write *tons* of unit tests too, you'll need them.
>>>>
>>>> Hope this helps,
>>>>
>>>> Furcy
>>>>
>>>>
>>>>
>>>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> > However, ideally we wish to manipulate the original query as
>>>>> delivered by the user (or as close to it as possible), and we’re finding
>>>>> that the tree has been modified significantly by the time it hits the hook
>>>>>
>>>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook
>>>>> - the bushy join conversion is already done by the time the hook gets
>>>>> called.
>>>>>
>>>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>>>>> hook.
>>>>>
>>>>> > Additionally we wish to track back ASTNodes to the character
>>>>> sequences in the source HQL that were their origin (where sensible), and
>>>>> ultimately hope to be able regenerate the query text from the AST.
>>>>>
>>>>> I started work on a Hive-unparser a while back based on this class,
>>>>> but it a world of verbose coding.
>>>>>
>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/a
>>>>> pache/hadoop/hive/ql/optimizer/calcite/translator/ASTConvert
>>>>> er.java#L850
>>>>>
>>>>> If you're doing active work on this, I'd like to help, because I need
>>>>> the AST -> query to debug CBO.
>>>>>
>>>>> > The use case, if you are interested, is a mutation testing framework
>>>>> for HQL. The testing of mutants is operational, but now we need to report
>>>>> on survivors, hence the need to track back from specific query elements to
>>>>> character sequences in the original query string.
>>>>>
>>>>> This sounds a lot like the fuzzing random-query-gen used in Cloudera
>>>>> to have Impala vs Hive bug-for-bug compat.
>>>>>
>>>>> https://cwiki.apache.org/confluence/download/attachments/273
>>>>> 62054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>>>>
>>>>> Cheers,
>>>>> Gopal
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HQL parser internals

Reply via email to