Re: Comparing keyword handling in Babel and Core parsers

Julian Hyde Wed, 17 Jul 2024 07:48:04 -0700

By the way, upgrading JavaCC is taking a whole. First, JavaCC “improved” its 
messages, which means that we have a lot of tests to change. I’m trying to 
massage the new messages so they look more like the old ones.


Second, the parser generated for Babel is now very different from the other 
parsers. Its error messages are missing the “Was expecting …” section because 
it doesn’t have the information about tokens that can follow a rule. I fear 
that the Babel parser has a degree of ambiguity that newer JavaCC cannot 
handle. I might be wrong, and I hope I’m wrong. 

Julian

> On Jul 17, 2024, at 7:01 AM, Stamatis Zampetakis <zabe...@gmail.com> wrote:
> 
> I agree that this is not something that should prevent us from
> upgrading to a newer JavaCC. It's interesting that the generated
> parser code is different but this is something that we could
> investigate in a follow-up. Even if Babel becomes slower after the
> upgrade we cannot stay with the 2006 release forever. For sure there
> are many more benefits in upgrading to the latest JavaCC version and
> it would be great to have this in the next Calcite release.
> 
> Best,
> Stamatis
> 
>> On Wed, Jul 10, 2024 at 8:03 PM Julian Hyde <jh...@apache.org> wrote:
>> 
>> I am working on https://issues.apache.org/jira/browse/CALCITE-5541
>> (upgrading JavaCC) and the Babel parser is having problems deducing
>> whether a keyword is reserved. Investigating this, I took a look at
>> the generated code, and found something interesting.
>> 
>> Here are the NonReservedKeyWord and NonReservedKeyWord0of3 methods in
>> Babel 
>> (babel/build/javacc/javaCCMain/org/apache/calcite/sql/parser/babel/SqlBabelParserImpl.java):
>> 
>>  final public String NonReservedKeyWord() throws ParseException {
>>    if (jj_2_1116(2)) {
>>      NonReservedKeyWord0of3();
>>    } else if (jj_2_1117(2)) {
>>      NonReservedKeyWord1of3();
>>    } else if (jj_2_1118(2)) {
>>      NonReservedKeyWord2of3();
>>    } else {
>>      jj_consume_token(-1);
>>      throw new ParseException();
>>    }
>>  {if ("" != null) return unquotedIdentifier();}
>>    throw new Error("Missing return statement in function");
>>  }
>> 
>>  /** @see #NonReservedKeyWord */
>>  final public void NonReservedKeyWord0of3() throws ParseException {
>>    if (jj_2_1119(2)) {
>>      jj_consume_token(A);
>>    } else if (jj_2_1120(2)) {
>>      jj_consume_token(ACTION);
>>    } else if (jj_2_1121(2)) {
>>      jj_consume_token(ADMIN);
>>    ...
>> 
>> And here are the same methods in Core
>> (core/build/javacc/javaCCMain/org/apache/calcite/sql/parser/impl/SqlParserImpl.java):
>> 
>>  final public String NonReservedKeyWord() throws ParseException {
>>    switch ((jj_ntk==-1)?jj_ntk_f():jj_ntk) {
>>    case A:
>>    case ACTION:
>>    case ADMIN:
>>    case APPLY:
>>    ...
>>    case YEARS:{
>>      NonReservedKeyWord0of3();
>>      break;
>>      }
>>    case ABSENT:
>>   ...
>>    case ZONE:{
>>      NonReservedKeyWord1of3();
>>      break;
>>      }
>>    ...
>>    default:
>>      jj_la1[436] = jj_gen;
>>      jj_consume_token(-1);
>>      throw new ParseException();
>>    }
>>  {if ("" != null) return unquotedIdentifier();}
>>    throw new Error("Missing return statement in function");
>>  }
>> 
>>  /** @see #NonReservedKeyWord */
>>  final public void NonReservedKeyWord0of3() throws ParseException {
>>    switch ((jj_ntk==-1)?jj_ntk_f():jj_ntk) {
>>    case A:{
>>      jj_consume_token(A);
>>      break;
>>      }
>>    case ACTION:{
>>      jj_consume_token(ACTION);
>>      break;
>>      }
>>    case ADMIN:{
>>      jj_consume_token(ADMIN);
>>      break;
>>      }
>>    ...
>> 
>> Both of the above are generated using JavaCC 7.0.13. Other parsers,
>> such as Server, look similar to Core. Under JavaCC 4.0, all parsers
>> generate a 'switch'.
>> 
>> In all parsers we split the reserved keywords into 3 rules (0of3,
>> 1of3, 2of3) due to the size restrictions noted in
>> https://issues.apache.org/jira/browse/CALCITE-2405.
>> 
>> I was puzzled why one is generating a 'switch' and the other is
>> generating chained 'if'...'else-if's. At first I thought it was that
>> Babel had more keywords, but some experiments eliminated that
>> possibility. I also disproved the hypothesis that it is because Babel
>> allows extra characters in identifiers (see
>> https://issues.apache.org/jira/browse/CALCITE-5668). My current
>> hypothesis is that Babel needs to use lookahead in order to determine
>> whether a non-reserved keyword can be converted to an identifier.
>> 
>> But whatever the reason, something seems to be very different about
>> the Babel grammar. Given how frequently identifiers occur when parsing
>> SQL, I would not be surprised if the Babel parser is significantly
>> slower than the regular parser under JavaCC 7.0.13.
>> 
>> In my opinion, that is not a bug that should prevent us from upgrading
>> JavaCC. Especially given that JavaCC 4.0 has a performance bug that is
>> affecting all of our parser variants.
>> 
>> Julian

Re: Comparing keyword handling in Babel and Core parsers

Reply via email to