Re: Semantics for regexes

Patrick R. Michaud Wed, 01 Sep 2004 13:55:18 -0700

On Wed, Sep 01, 2004 at 01:07:49PM -0700, Larry Wall wrote:
> On Wed, Sep 01, 2004 at 01:57:32PM -0400, Dan Sugalski wrote:
> : I promised Patrick this a while back but never got it, so here it is.
> : 
> : This is a list of the semantics that I see as needed for a regex 
> : engine. When we have 'em, we'll map them to string ops, and may well 
> : add in some special-case code for faster access.
> : 
> : *) extract substring
> 
> Mostly you don't want to go to the trouble of extracting substrings
> until you're forced to, because you're continually creating and
> destroying backrefs into your string, and you don't want to be
> copying characters around for that.  As long as there's some kind
> of COWish semantics to keep around an original copy of the string
> being searched, that's probably all the regex engine itself wants.


Indeed, what I've been working towards in my head is that a rule
would consist of a set of "components" which can nest and can refer to
other rules, and each component would simply keep track of its
current start and end positions of what it matches in the string,
as well as what to do if that component needs to backtrack/continue 
because a successive component is unable to achieve its part of the match
(or if we're looking for an C<:exhaustive> set of matches).  As Larry
mentioned in a previous conversation, the rules engine will need to
need to remember its state without keeping a large stack (fortunately
there appear to be a number of possible optimizations here).

So, what a rule component really wants to do is to be able to perform
comparisons and tests of substrings in-place, as opposed to extracting
them to perform the match.  Opcodes to support that would be most
helpful (they may already be there--I just haven't looked at that part
yet).

And yes, my current state of thinking is that each rule component
is a Parrot object of some sort that knows where it fits in the
rule and can pass information and control flow to/from its neighbors 
in trying to get the rule(s) to match the target string.

> : *) find first character of class X in string
> : *) find first character not of class X in string
> We're gonna run into the "what's a character?" issue here, especially
> at higher Unicode levels where what the user things of as a single
> character is really a sequence of codepoints.  
> : [...]
> : *) create new class X
> : *) add or subtract character to class X
> : *) create union|intersection|difference of two classes
> Not sure you really need opcodes for those if character classes are
> just real objects with a particular interface.

Indeed, I was thinking that character classes would just be more
instances of rule component objects, and these would have methods
for building unions/intersections/differences.  Or, the rules compiler
itself can do much of this when it constructs the character class 
components, rather than try to build them dynamically in Parrot.

> (Also, the Perl 6 parser may be based on the notion of a set of hash keys
> being treated a bit like a lex grammar, 

Yes, I think this is likely.

> (Also, minor nit, I'm not sure "find" is the right verb here and
> elsewhere.  Mostly the regex engine just wants to check the "current"
> location.  The rest is control flow.)

Yes, I'm (pardon the pun) finding that advanced "string find" operations
may not be of great importance until we start looking at some special-case
optimizations.  I'd say to wait for those opcodes until we find (sorry!)
we need them.

> I see one other potential gotcha with respect to backtracking and
> closures.  In P6, a closure can declare a hypothetical variable
> that is restored only if the closure exits "unsuccessfully".  [...]

Indeed, but I'm gonna cross that bridge when I get to it.

Pm

Re: Semantics for regexes

Reply via email to