Re: Suggestion for perl 6 regex syntax

Ken Fox Sat, 07 Sep 2002 14:06:34 -0700

Mr. Nobody wrote:
> /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/
> 
> would actually become longer:
> 
> /^(<[+-]>?)<before \d|\.\d>\d*(\.\d*)?(<[Ee]>(<[+-]>?\d+))?$/


Your first expression uses capturing parens, but the captures
don't bind anything useful, so you should probably compare
non-capturing versions of the regex:

/^[+-]?(?=\d|\.\d)\d*(?:\.\d*)?(?:[Ee][+-]?\d+)?$/

vs

/^<[+-]>?<before \d|\.\d>\d*[\.\d*]?[<[Ee]><[+-]>?\d+]?$/

The <[Ee]> isn't the way I'd write it in Perl 6 -- I'd shift
into case-insensitive mode temporarily because those
hand-written [Cc][Aa][Ss][Ee] insensitive matches are hard
to read.

/^<[+-]>?<before \d|\.\d>\d*[\.\d*]?[:i e<[+-]>?\d+]?$/

Now Perl 6 is just 5 characters longer. That's a horrible
pattern to read though. Can Perl 6 "fix" that? I think so.

I'd change the <[+-]> fragments to use a sub-rule because
repeated constants make things harder to read. (Not so bad
in this case, but it's a good general rule -- and you're making
generalizations about regex syntax.)

/^<sign>?<before \d|\.\d>\d*[\.\d*]?[:i e<sign>?\d+]?$/

I'd put in some white space to clarify the different logical
pieces of the rule:

/^ <sign>? <before \d | \.\d>
    \d* [\.\d*]?
    [:i e <sign>? \d+]? $/

Now it's pretty obvious that the :i can be moved outside the
rule without screwing anything up. I'd rather have modifiers
affect the whole rule rather than remembering where they begin
and end inside it.

:i /^ <sign>? <before \d | \.\d>
       \d* [\.\d*]?
       [e <sign>? \d+]? $/

That's how I'd write your Perl 5 regex in Perl 6. (Well,
actually it's probably just /^ <number> $/, but would you
call that cheating? ;)

It does have more characters than the Perl 5 regex. Looking
at it another way, it has fewer symbols. It's faster to read.
How many times are you going to write it? How many times are
you going to read it?

When I was reading A5, I was concerned about character
classes too, but mostly because of the regex style that I
learned from the Friedl book:

   opening normal* ( special normal* )* closing

which can be used to match quoted strings for example:

   /"[^"\\]*(\\.[^"\\]*)*"/

The direct Perl 6 equivalent is not very pretty:

   /" <-["\\]>* [ \\. <-["\\]>* ]* "/

It's hard to come up with a good name for the character
class used there. not_a_quote_or_slash? special_char_in_quote?

I'm not concerned about it anymore because I think the
Perl 6 style will be:

   opening ( special :: | . )*? closing

The non-greedy match makes so many things easier to write
and the backtracking control prevents the special case
from accidentally matching the normal one.

I'd write the string match in Perl 6 like this:

   /" [ \\. :: | . ] *? "/

The only possible problem with this is non-greedy
iteration is slower. It doesn't have to be though -- and
the optimizations needed to get Perl 6 rules to match
full grammars should fix this.

If the pattern is rewritten as a grammar, we can
talk about first and follow sets.

   <quoted_string>: " <quoted_char_seq> "
<quoted_char_seq>: <null>
                  | <quoted_char> <quoted_char_seq>
     <quoted_char>: \ <any>
                  | <any>

The reason non-greedy matching is slow is because the
rule <quoted_char_seq> can be empty, i.e. it always
matches the current spot. However, the follow set of
<quoted_char_seq> is the quote. That means the *only*
thing that can follow <quoted_char_seq> is a quote.
There's no point in returning (taking the <null> route)
unless the rule is looking at a quote. This reduces
backtracking tremendously.

The other problem would normally be in the conflict
between the first sets of <quoted_char>. The slash
character is also an <any> character, so if the slash
alternative is taken, the system has to prepare to
backtrack to the <any> alternative. The :: backtracking
control eliminates the backtracking point, so it's
impossible for an escape sequence to be re-parsed
as two separated characters.

Damian wrote several good examples of Perl 5 -> Perl 6
conversions. Take a look at E5 and experiment some
more. The built-in named rules may simplify a lot of
things too -- we're going to have a much richer library
than just \d, \w, etc.

- Ken

Re: Suggestion for perl 6 regex syntax

Reply via email to