Re: Parsing a language with optional spaces

Christian Schoenebeck Sun, 12 Jul 2020 10:48:14 -0700

On Donnerstag, 9. Juli 2020 06:51:38 CEST Akim Demaille wrote:
> > Le 8 juil. 2020 à 22:14, Akim Demaille <a...@lrde.epita.fr> a écrit :
> > 
> > That reminds me of a paper I read long ago, someone proposing
> > "heisentokens": having the scanner return multiple tokens concurrently,
> > for the different scanning options, branched into GLR to follow the
> > various options and let dead ends die (that would be a second time?).
> 
> I couldn't find references to that for a good reason: it was referring
> to Schrödinger, not Heisenberg.
> 
> https://webhome.cs.uvic.ca/~nigelh/Publications/Schrodinger.pdf


Yeah, almost what I had in mind. But where I disagree, quote:

> Lexical feedback
> 
> The scanner can  be made aware of context if  the parser and  scanner share
> some state.  The parseruses this shared state to communicate context
> information to the scanner; this is referred to as lexicalfeedback
> [6,7].
> 
> Lexical feedback has a number of problems in practice. It couples the
> scanner and parser tightly:not only do they share state, but the parser and
> scanner must operate in lock-step. The scanner cannot,for example, tokenize
> the entire input in a tight loop, or operate in a separate thread of
> execution,because at any moment the parser may direct it to change how it
> interprets the input. 

Running scanner and parser in different threads is IMO quite exotic. And 
multi-threading could still be handled on GLR parser side for the individual 
branches that fork and pop up, if that multi-threading optimization would 
really be necessary for an application.

> Additionally,the programmer must fully understand how
> and when the parser handles tokens, otherwise subtle bugsmay be introduced.

I don't see what they exactly had in mind here for that to claim. I would even 
claim the opposite; their suggestion involves a much more developer-aware 
approach than Lexical feedback (bidirectional communication between parser and 
scanner) would require.

In their example, and using their suggested approach, a developer would need 
to be aware that there is an ambiguity in the language for detecting the 
character sequence 'if' as potentially two different tokens (IF or ID), so 
(s)he would write:

if             return schrodinger(IF, ID);

That's error prone. In this particular example it might be obvious, in other 
scanarios it certainly isn't.

If both lexical patterns, and grammar rules are defined and handled together 
'married' on parser side (like I actually had it in mind), the parser could 
automatically take care of these ambiguities on behalf of the developer in a 
very simple and intuitive way:

%token IF
%token THEN
%token ID

%glr-parser

 /* token ID -> lexical RegEx patterns */
IF: if
THEN: then
ID: [a-zA-Z][a-zA-Z0-9_]*

 /* grammar rules */
stmt: ifstmt | asgnstmt
ifstmt: IF expr THEN stmt
asgnstmt: ID '=' expr
expr: ID = ID

I find that pretty much developer friendly.

Note that the latter example would of course [require to] work differently 
than the common greedy and context unaware scanner:

As a 'married' GLR parser would now have knowledge of both the vocabulary 
(tokens and their raw character patterns) and the grammar rules, it could  
handle the ambiguity between the 2 tokens ID and IF as 'Schrödinger' tokens 
automatically for the developer: it would activate only those tokens/patterns 
which would be able to produce the next possible grammar rules, and if a 
context would allow both tokens, it would automatically process the 2 possible 
tokens as that particular set of 'Schrödinger' tokens and eventually fork the 
GLR state accordingly.

Besides being more easy for the developer to understand and write, and the 
language definition being easier to read and maintain, it would also have 
other benefits:

The scanner steps could be optimized as well: If scanners are unaware (like 
they are today) of any grammatical context, they must assume that any pattern 
may match at any time. Hence the lexical automat typically runs much longer, 
on a much more complex state machine than it needs to be, as a context unaware 
scanner always must assume that any of the globally defined tokens might 
match, even though in a specific grammatical context only e.g. 1 or 2 tokens 
may actually be possible being matched.

Akim, question regarding Bison's GLR parser ...
https://www.gnu.org/software/bison/manual/html_node/Simple-GLR-Parsers.html :

> In general, a GLR parser can take quadratic or cubic worst-case time, and
> the current Bison parser even takes exponential time and space for some
> grammars. In practice, this rarely happens, and for many grammars it is
> possible to prove that it cannot happen.

... why isn't it O(n^3)? Or more specifically, which common GLR fork 
optimization(s) are yet missing?

And BTW:

> The GLR parsers require a compiler for ISO C89 or later.

Tough requirement! ;-)

Best regards,
Christian Schoenebeck

Re: Parsing a language with optional spaces

Reply via email to