On Sat, Aug 29, 2009 at 02:45:08PM -0700, Gilbert R. Roehrbein (via RT) wrote:
> BUG:
> $ perl6
>  > say '(foo)' ~~ /'(' ~ ')' .*?/
> Unable to parse , couldn't find final ')'
> in regex PGE::Grammar::_block21 (<unknown>:1)
> called from Main (<unknown>:1)
> 
> TEST:
> ok( '(foo)' ~~ /'(' ~ ')' .*?/ )
> 
> The problem is  .*? , i think, cause  '(foo)' ~~ /'(' ~ ')' 'foo'/ 
> matches. 

Currently Synopsis 5 is a bit unclear on the handling of backtracking
using the ~ operator in regexes.  The current definition says that
something of the form

    '(' ~ ')' <expression>

gets rewritten to be something like

    '(' <expression> [ ')' || <FAILGOAL> ]

Note that there's no way to backtrack into <expression> -- once we've
matched <expression>, we either find the closing token or we fail.

So in the case of the problem regex above, we end up with

    '(' .*? [ ')' || <FAILGOAL> ]

which will match only "()", because there's no possibility of 
backtracking into the .*? to find longer strings between the parens.

At one time I tried changing the definition of ~ so that it
could allow backtracking into the expression

    '(' [ <expression> ')' || <FAILGOAL> ]

but ISTR that I ran into some other issues there and gave up for the
time being.

So, short answer is that I think Rakudo is correctly following the
specification here, but we may need to tweak the specification a bit.

> The bug was introduced between June 30 and today, cause similar 
> code is used in http://github.com/krunen/xml/tree/master, last updated 
> June 30

AFAIK none of the related code has been changed between June 30 and today,
so I'm guessing something else must be happening there.  Looking at the
grammar that is given at that address now, I see

    token comment { '<!--' ~ '-->' <content> }
    token pi { '<?' ~ '?>' <content> }
    token content { .*? }

Given that these are all "token" (no backtracking), that would mean that
the calls to the <content> subrule will only ever match an empty string.

Thanks!

Pm

Reply via email to