Re: More character matching bits

Bryan C . Warnock Fri, 15 Jun 2001 17:19:38 -0700
On Friday 15 June 2001 06:58 pm, Dan Sugalski wrote:
> > > >module Locale::Hawaiian;
> > > >use re 'class (\w => [aeiouâêîôûhklmnpw`])';
> > > >...
> > >
> > > Sure. I expect Damian will write us something that lets you specify
> > > them upside-down in Klingon or something by the time this is done. :)
> >
> >This is handy, but this means the regexp engine needs to be *VERY*
> > dynamic at runtime.
>
> Yep. The trick is to figure out how to do this without it being expensive.
> Handy trick, that one.

use re 'cheap';  (Actually, I'm trying to grab hold of some clueage with the 
p5regexen.  Currently, it's got a hold of me.  ;-)

Here's a couple of assumptions we should probably make to draw the line 
*somewhere*.  (Anything further, I think, will cause more Harm than Good.)
Some of these are fairly obvious, and the rest are, well, probably even more 
so.

- Locales and, er, regex parameters(?) affect compilation only  With qr//, 
for example, the regex is run under the rules it was compiled under, even if 
the actual match happens in a block where different rules are in affect.

{
    use locale 'New Haven';
    $re1 = qr/yale/i;  # Looking for a "Yale".
}

{
    use locale 'Mobile';
    $re2 = qr/yale/i;  # a-lookin' fer a "yale"
    print "yell" =~ $re1;  # No match.  $re1 runs under Eli rules
    print "yell" =~ $re2;  # A match under 'Baman rules.
}

- Regexes are oblivious to the locale of any bits interpolated in, or of the 
string it's comparing against.  (I don't know if we were even planning on 
toting around the locale a string was constructed under, but if we were, the 
regex wouldn't care.)  

- No dynamically changing the rules mid-match.  The parsing rules you start 
with are the parsing rules you finish with.  

- Limit the creation or overriding of control structures.  Ideally, all this 
overriding and creation is fundamentally nothing more than running a regex 
against the regex (like custom regexen now, only *after* interpolation), 
except we let the regexen handle the lexing and parsing, and the hooks 
create the acual opcode branch(es), instead of another string to be parsed. 

The primary concern behind my original table, and possibly consolidating 
some of the non-standard escape sequences (\[QUL]..\E -> switch status) was 
so the lexing and parsing rules wouldn't change.  ie, [^e-m] would always be 
CHARACTER CLASS, NEGATION, RANGE(e,m).  But instead of the regexen itself 
determining what 'e-m' means, how it is negated, it allows user-space to 
define what 'e-m' is, and how it is negated.  The regexen uses that info to 
then set up the character class.  That's all.  (With the possible exception 
of literals, which I'm nominally treating as a character class of one.)  No 
mutating a character class into a negative lookbehind assertion, for 
instance.

After all, you can *still* just run a regex against it if you wanted more 
complicated behavior.  I think of it as the regex equivalent of hard versus 
symbolic references.  Regardless of whether it's me or the regexen that's 
aiming, it's still the same foot I'm shooting.  

With those limitations in place, it's not really much more dynamic than now, 
unless you start getting (??{ crazy, which is just naturally hairy, anyways. 

I'm sure this is gross oversimplification, but am I close? 

-- 
Bryan C. Warnock
[EMAIL PROTECTED]
Re: More character matching bits

Reply via email to