Implementation of :w in regexes and other regex questions

David Romano Mon, 13 Feb 2006 16:45:03 -0800

Hello everyone,
This is my first post to the actual mailing list and not to Google Groups
(yeah, took me a bit to figure out they're not the same). I have a few
questions about the rules in Perl 6, and hopefully I'm not repeating stuff
that's already been brought up before. (I searched through the archive a bit,
but didn't see anything.)


==Question 1==
macro rxmodinternal:<x> { ... } # define your own /:x() stuff/
macro rxmodexternal:<x> { ... } # define your own m:x()/stuff/
With this, I can make my own adverbs then? Like :without, or :skip, and
describe what each does? If so, then maybe the rest of my major questions have
a very simple answer: make it yourself. If that is the case, I'll try to
figure out how to do it with pugs, if possible.

==Question 2==
I finished reading E05 and A05, and I really like the idea of the :w modifier
being able to essentially skip over certain parts of the text. Right now A05
states:

> <?ws> can't decide what to do until it sees the data. It still does the
> right thing. If not, define your own <?ws> and :w will use that.

So is :w invoking a rule that just skips whatever it matches? What I'm
wondering about is how I can create a mechanism that acts like :w, but can be
combined for nested rules. For instance, say I'm trying to pull out date from
(html) text:
...
Jan had a great birthday on <B>F e b</B> 5, 2<B>00</B>3.
Her older sister, May, turned 23 on <B>Ma r</B> 5, 19<b>98</b>
Their younger sister, June, will be going home on <B >Apr</B> 5,
2<B>006</B>
April is their mother, and she's buying a car on <B>Feb< / B > 7,
2<B>0</B>06
I don't know when Roger, their father, is going to buy his guitar.
...

The grammar becomes messy when I have to account for things that the rules
don't allow me to just easily skip:
grammar Date {
    rule tag_B_beg:w:i { \<B\> }
    rule tag_B_end:w:i { \<\/B\> }
    rule tag_B:w:i { <tag_B_beg>|<tag_B_end> }

    rule month_english:w:i {
          J<sp>*a<sp>*n       | F<sp>*e<sp>*b | M<sp>*a<sp>*r
        | A<sp>*p<sp>*r       | M<sp>*a<sp>*y | J<sp>*u<sp>*n<sp>*e
        | J<sp>*u<sp>*l<sp>*y | A<sp>*u<sp>*g | S<sp>*e<sp>*p
        | O<sp>*c<sp>*t       | N<sp>*o<sp>*v | D<sp>*e<sp>*c
    }

    rule year:w:i { (\d<tag_B>?\d<tag_B>?\d<tag_B>?\d) }
    rule month:w:i { <after <tag_B_beg> > (<month_english>) <before
<tag_B> > }

    rule day {
        <after <month> >
        ( <after <[1..2]> >? <[1..9]> | 3<[0..1]> ) <sp>+
        <before <year> >
    }

    rule date { <month> <day> <year> }

}

I don't want to just skip <B> tags wholly, because they do serve a purpose,
but only in a particular context. (Can <?ws> be changed back to a "default" if
changed to include html tags?) I was thinking about maybe using a closure
at the beginning of the rule (to change the string about to be processed) and
then a closure at the end of the rule (to change it back to its pre-processed
form) to make it work:
grammar Date {
    rule tag_B_beg:w:i { \<B\> }
    rule tag_B_end:w:i { \<\/B\> }

    rule month_english:w:i {
        { $/ ~~ s/<sp>// }
        [  Jan | Feb  | Mar  | Apr
         | May | June | July | Aug
         | Sep | Oct  | Nov  | Dec
        ]
        { $/ ~~ $/.pretext }
    }

    rule year:w:i {
        { $/ ~~ s/<tag_B_beg>|<tag_B_end>// }
        (\d{4})
        { $/ ~~ $/.pretext }
    }

    rule month:w:i { <after <tag_B_beg> > (<month_english>) <before
<tag_B> > }

    rule day {
        <after <month> >
        ( <after <[1..2]> >? <[1..9]> | 3<[0..1]> ) <sp>+
        <before <year> >
    }

    rule date { <month> <day> <year> }

}

That's okay to do right? It looks a lot cleaner to me, but I'm wondering if
there's a better way to skip a rule match in another rule (another adverb
like :skip, with :w being a built-in shorthand for :skip(<?ws>)). Or am I
making this too complex when it really isn't? Any pointers on how to do stuff
like this more simply?

==Question 3==
I'm also curious about exclusions. Right now, to do a general exclusion, I'm
thinking I would probably do something like:
rule text_no_date {
    {$/ !~ /<date>/ }
    ^ [.*] $

}

Would something like below be easier to decode for a human reader?
text:without(<date>) {
    ^ [.*] $

}

If that adverb were available, then I could have a rule that doesn't include
two other rules:
line:without(<date>&&<name>) {
    ^^ [.*] $$

}

The rule above would match a line with a <date> or <name>, but not a line with
both. Like I said before, I don't know if this is the best way to do stuff
like this, or if I'm thinking about these problems the wrong way, so *any*
help would be great.

Thanks,
David

Implementation of :w in regexes and other regex questions

Reply via email to