Hi Elliot,

Actually, I have done quite similar work regarding Hive custom Parsing, you
should have a look at my project: https://github.com/flaminem/flamy

The Hive parsing related stuff is here:
https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
A good starting point to see how to parse queries is here:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492


Basically, all you need is to use a
org.apache.hadoop.hive.ql.parse.ParseDriver.

val pd: ParseDriver = new ParseDriver
val tree: ASTNode = pd.parse(query, hiveContext)

You then get the ASTNode, that you can freely parse and change.
Also, I must say that it is quite ugly to manipulate, and the Presto Parser
seems to be much better designed (but it is not the same syntax,
unfortunately),
I recommend to look at it to get better design ideas.


If you want to enrich your Hive syntax like I did (I wanted to be able to
parse ${VARS} in queries),
you will not be able to use the HiveParser without some workaround.
What I did was replacing these ${VARS} by strings "${VARS}" that the
HiveParser would agree to parse,
and that I could recognize afterwards...

Also, if you are familiar with Scala, I recommend using it, it helps a
lot...

For instance, I have this class that transforms an AST back into a string
query:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
I could never have done something that good looking in Java...

Finally this method helps a lot to understand how the hell the AST works:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593

Make sure to write *tons* of unit tests too, you'll need them.

Hope this helps,

Furcy



On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > However, ideally we wish to manipulate the original query as delivered
> by the user (or as close to it as possible), and we’re finding that the
> tree has been modified significantly by the time it hits the hook
>
> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
> the bushy join conversion is already done by the time the hook gets called.
>
> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
> hook.
>
> > Additionally we wish to track back ASTNodes to the character sequences
> in the source HQL that were their origin (where sensible), and ultimately
> hope to be able regenerate the query text from the AST.
>
> I started work on a Hive-unparser a while back based on this class, but it
> a world of verbose coding.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/calcite/translator/
> ASTConverter.java#L850
>
> If you're doing active work on this, I'd like to help, because I need the
> AST -> query to debug CBO.
>
> > The use case, if you are interested, is a mutation testing framework for
> HQL. The testing of mutants is operational, but now we need to report on
> survivors, hence the need to track back from specific query elements to
> character sequences in the original query string.
>
> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
> have Impala vs Hive bug-for-bug compat.
>
> https://cwiki.apache.org/confluence/download/attachments/27362054/Random%
> 20Query%20Gen-%20Hive%20Meetup.pptx
>
> Cheers,
> Gopal
>
>
>

Reply via email to