Re: Parsing a language with optional spaces

Akim Demaille Wed, 08 Jul 2020 13:15:25 -0700

Hi Christian,

> Le 8 juil. 2020 à 12:15, Christian Schoenebeck <schoeneb...@crudebyte.com> a 
> écrit :
> 
> On Mittwoch, 8. Juli 2020 06:24:13 CEST Akim Demaille wrote:
>> I still think you can address this case with Flex, but I agree it's
>> going to be painful.  I would go for something like
>> 
>> sp   [ \t]*
>> do   D{sp}O
>> 
>> id   [a-zA-Z]({sp}[a-zA-Z_0-9]+)*
> 
> do 10 i = 1, n
> 
> would then be interpreted as assignment to variable 'do10i', it is a loop 
> definition though.


Yes, of course.  I was just addressing the issue of the intertwined
spaces only.

> So yes, you could certainly address this to work correctly with Flex with 
> additional measures,

The case you are citing above is straightforward to handle the same
way I did with the BASIC.  That's fairly simple given that this is
plain rational languages, fully under the perimeter of Flex.

The general case though will not work just as well.  John reported that
the grammar is

DO <number> <variable> = <expression> , <expression> [, <expression>]

and here the problem is that the <expression>s can have parenthesized
expressions with embedded commas.  So on this case, Flex is screwed.

That reminds me of a paper I read long ago, someone proposing
"heisentokens": having the scanner return multiple tokens concurrently,
for the different scanning options, branched into GLR to follow the
various options and let dead ends die (that would be a second time?).

> but I think both the Fortran and BASIC examples could 
> much easier (less complex) and elegantly be solved with a monolithicly 
> combined parser-scanner, as the parser could then out of the box detect 
> keywords depending on the grammar context.

I'm all in favor of merging the scanner into the parser, but in the
current case, I believe you are wrong: it is not going to be easier in
such a framework.

The syntactic context (provided by the parser to the scanner) is only
left-context, only about things we already read.  In this precise case
(FORTRAN with intertwined spaces), we need right-context, things we
have not read yet.

So merging the scanner into the parser won't solve anything, because a
lot of lookahead is needed to know in which case we are.  The hard
problem is not on the scanner, but on the parser.  LR(1) won't DO.  As
a matter of fact, given that the comma can be arbitrarily far away, no
LR(k) would DO either.

It will be trivial for GLR though.


That being said, we are looking at constructs that are really
uncommon, and in the "regular" cases, having syntactic context will
make thing way easier, I fully agree.

Re: Parsing a language with optional spaces

Reply via email to