Hi Christian, > Le 8 juil. 2020 à 12:15, Christian Schoenebeck <schoeneb...@crudebyte.com> a > écrit : > > On Mittwoch, 8. Juli 2020 06:24:13 CEST Akim Demaille wrote: >> I still think you can address this case with Flex, but I agree it's >> going to be painful. I would go for something like >> >> sp [ \t]* >> do D{sp}O >> >> id [a-zA-Z]({sp}[a-zA-Z_0-9]+)* > > do 10 i = 1, n > > would then be interpreted as assignment to variable 'do10i', it is a loop > definition though.
Yes, of course. I was just addressing the issue of the intertwined spaces only. > So yes, you could certainly address this to work correctly with Flex with > additional measures, The case you are citing above is straightforward to handle the same way I did with the BASIC. That's fairly simple given that this is plain rational languages, fully under the perimeter of Flex. The general case though will not work just as well. John reported that the grammar is DO <number> <variable> = <expression> , <expression> [, <expression>] and here the problem is that the <expression>s can have parenthesized expressions with embedded commas. So on this case, Flex is screwed. That reminds me of a paper I read long ago, someone proposing "heisentokens": having the scanner return multiple tokens concurrently, for the different scanning options, branched into GLR to follow the various options and let dead ends die (that would be a second time?). > but I think both the Fortran and BASIC examples could > much easier (less complex) and elegantly be solved with a monolithicly > combined parser-scanner, as the parser could then out of the box detect > keywords depending on the grammar context. I'm all in favor of merging the scanner into the parser, but in the current case, I believe you are wrong: it is not going to be easier in such a framework. The syntactic context (provided by the parser to the scanner) is only left-context, only about things we already read. In this precise case (FORTRAN with intertwined spaces), we need right-context, things we have not read yet. So merging the scanner into the parser won't solve anything, because a lot of lookahead is needed to know in which case we are. The hard problem is not on the scanner, but on the parser. LR(1) won't DO. As a matter of fact, given that the comma can be arbitrarily far away, no LR(k) would DO either. It will be trivial for GLR though. That being said, we are looking at constructs that are really uncommon, and in the "regular" cases, having syntactic context will make thing way easier, I fully agree.