Hi Elliot, Yes, the variable substitution is done before the parsing. This make generic query validation much more complicated.
As I explained in my previous message, what I did was replacing these ${VARS} by strings "${VARS}" that the HiveParser would agree to parse, and that I could recognize afterwards... On 19 March 2018 at 22:33, Elliot West <tea...@gmail.com> wrote: > Hello again, > > We're now testing our system against a corpus of Hive SQL statements in an > effort to quickly highlight edge cases, limitations etc. We're finding that > org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such > as ${hiveconf:varname}. Are variable substitutions handled prior to > parsing or within the parser itself? If in a pre-procesing stage, is there > any code or utility classes within Hive that we can use as a reference, or > to provide this functionality? > > Cheers, > > Elliot. > > On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote: > >> Thank you all for your rapid responses; some really useful information >> and pointers in there. >> >> We'll keep the list updated with our progress. >> >> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com> >> wrote: >> >>> +1 for using ParseDriver for this. I also have used it to intercept and >>> augment query AST. >>> >>> Also, I would echo others sentiment that its quite ugly. It would be >>> great if we can refactor/standardize this. That will make integrating >>> other system a lot easier. >>> >>> Thanks, >>> Dharmesh >>> >>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote: >>> >>>> Hi Elliot, >>>> >>>> Actually, I have done quite similar work regarding Hive custom Parsing, >>>> you should have a look at my project: https://github.com/flaminem/flamy >>>> >>>> The Hive parsing related stuff is here: https://github.com/flami >>>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive >>>> A good starting point to see how to parse queries is here: >>>> https://github.com/flaminem/flamy/blob/master/src/main/scala >>>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492 >>>> >>>> >>>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars >>>> e.ParseDriver. >>>> >>>> val pd: ParseDriver = new ParseDriver >>>> val tree: ASTNode = pd.parse(query, hiveContext) >>>> >>>> You then get the ASTNode, that you can freely parse and change. >>>> Also, I must say that it is quite ugly to manipulate, and the Presto >>>> Parser seems to be much better designed (but it is not the same syntax, >>>> unfortunately), >>>> I recommend to look at it to get better design ideas. >>>> >>>> >>>> If you want to enrich your Hive syntax like I did (I wanted to be able >>>> to parse ${VARS} in queries), >>>> you will not be able to use the HiveParser without some workaround. >>>> What I did was replacing these ${VARS} by strings "${VARS}" that the >>>> HiveParser would agree to parse, >>>> and that I could recognize afterwards... >>>> >>>> Also, if you are familiar with Scala, I recommend using it, it helps a >>>> lot... >>>> >>>> For instance, I have this class that transforms an AST back into a >>>> string query: >>>> https://github.com/flaminem/flamy/blob/master/src/main/scala >>>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala >>>> I could never have done something that good looking in Java... >>>> >>>> Finally this method helps a lot to understand how the hell the AST >>>> works: >>>> https://github.com/flaminem/flamy/blob/master/src/main/scala >>>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593 >>>> >>>> Make sure to write *tons* of unit tests too, you'll need them. >>>> >>>> Hope this helps, >>>> >>>> Furcy >>>> >>>> >>>> >>>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org> >>>> wrote: >>>> >>>>> >>>>> > However, ideally we wish to manipulate the original query as >>>>> delivered by the user (or as close to it as possible), and we’re finding >>>>> that the tree has been modified significantly by the time it hits the hook >>>>> >>>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook >>>>> - the bushy join conversion is already done by the time the hook gets >>>>> called. >>>>> >>>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer >>>>> hook. >>>>> >>>>> > Additionally we wish to track back ASTNodes to the character >>>>> sequences in the source HQL that were their origin (where sensible), and >>>>> ultimately hope to be able regenerate the query text from the AST. >>>>> >>>>> I started work on a Hive-unparser a while back based on this class, >>>>> but it a world of verbose coding. >>>>> >>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/a >>>>> pache/hadoop/hive/ql/optimizer/calcite/translator/ASTConvert >>>>> er.java#L850 >>>>> >>>>> If you're doing active work on this, I'd like to help, because I need >>>>> the AST -> query to debug CBO. >>>>> >>>>> > The use case, if you are interested, is a mutation testing framework >>>>> for HQL. The testing of mutants is operational, but now we need to report >>>>> on survivors, hence the need to track back from specific query elements to >>>>> character sequences in the original query string. >>>>> >>>>> This sounds a lot like the fuzzing random-query-gen used in Cloudera >>>>> to have Impala vs Hive bug-for-bug compat. >>>>> >>>>> https://cwiki.apache.org/confluence/download/attachments/273 >>>>> 62054/Random%20Query%20Gen-%20Hive%20Meetup.pptx >>>>> >>>>> Cheers, >>>>> Gopal >>>>> >>>>> >>>>> >>>> >>> >> >