On Donnerstag, 9. Juli 2020 06:51:38 CEST Akim Demaille wrote: > > Le 8 juil. 2020 à 22:14, Akim Demaille <a...@lrde.epita.fr> a écrit : > > > > That reminds me of a paper I read long ago, someone proposing > > "heisentokens": having the scanner return multiple tokens concurrently, > > for the different scanning options, branched into GLR to follow the > > various options and let dead ends die (that would be a second time?). > > I couldn't find references to that for a good reason: it was referring > to Schrödinger, not Heisenberg. > > https://webhome.cs.uvic.ca/~nigelh/Publications/Schrodinger.pdf
Yeah, almost what I had in mind. But where I disagree, quote: > Lexical feedback > > The scanner can be made aware of context if the parser and scanner share > some state. The parseruses this shared state to communicate context > information to the scanner; this is referred to as lexicalfeedback > [6,7]. > > Lexical feedback has a number of problems in practice. It couples the > scanner and parser tightly:not only do they share state, but the parser and > scanner must operate in lock-step. The scanner cannot,for example, tokenize > the entire input in a tight loop, or operate in a separate thread of > execution,because at any moment the parser may direct it to change how it > interprets the input. Running scanner and parser in different threads is IMO quite exotic. And multi-threading could still be handled on GLR parser side for the individual branches that fork and pop up, if that multi-threading optimization would really be necessary for an application. > Additionally,the programmer must fully understand how > and when the parser handles tokens, otherwise subtle bugsmay be introduced. I don't see what they exactly had in mind here for that to claim. I would even claim the opposite; their suggestion involves a much more developer-aware approach than Lexical feedback (bidirectional communication between parser and scanner) would require. In their example, and using their suggested approach, a developer would need to be aware that there is an ambiguity in the language for detecting the character sequence 'if' as potentially two different tokens (IF or ID), so (s)he would write: if return schrodinger(IF, ID); That's error prone. In this particular example it might be obvious, in other scanarios it certainly isn't. If both lexical patterns, and grammar rules are defined and handled together 'married' on parser side (like I actually had it in mind), the parser could automatically take care of these ambiguities on behalf of the developer in a very simple and intuitive way: %token IF %token THEN %token ID %glr-parser /* token ID -> lexical RegEx patterns */ IF: if THEN: then ID: [a-zA-Z][a-zA-Z0-9_]* /* grammar rules */ stmt: ifstmt | asgnstmt ifstmt: IF expr THEN stmt asgnstmt: ID '=' expr expr: ID = ID I find that pretty much developer friendly. Note that the latter example would of course [require to] work differently than the common greedy and context unaware scanner: As a 'married' GLR parser would now have knowledge of both the vocabulary (tokens and their raw character patterns) and the grammar rules, it could handle the ambiguity between the 2 tokens ID and IF as 'Schrödinger' tokens automatically for the developer: it would activate only those tokens/patterns which would be able to produce the next possible grammar rules, and if a context would allow both tokens, it would automatically process the 2 possible tokens as that particular set of 'Schrödinger' tokens and eventually fork the GLR state accordingly. Besides being more easy for the developer to understand and write, and the language definition being easier to read and maintain, it would also have other benefits: The scanner steps could be optimized as well: If scanners are unaware (like they are today) of any grammatical context, they must assume that any pattern may match at any time. Hence the lexical automat typically runs much longer, on a much more complex state machine than it needs to be, as a context unaware scanner always must assume that any of the globally defined tokens might match, even though in a specific grammatical context only e.g. 1 or 2 tokens may actually be possible being matched. Akim, question regarding Bison's GLR parser ... https://www.gnu.org/software/bison/manual/html_node/Simple-GLR-Parsers.html : > In general, a GLR parser can take quadratic or cubic worst-case time, and > the current Bison parser even takes exponential time and space for some > grammars. In practice, this rarely happens, and for many grammars it is > possible to prove that it cannot happen. ... why isn't it O(n^3)? Or more specifically, which common GLR fork optimization(s) are yet missing? And BTW: > The GLR parsers require a compiler for ISO C89 or later. Tough requirement! ;-) Best regards, Christian Schoenebeck