Re: Distributing traits / Rule-matching group properties

Larry Wall Sat, 28 Feb 2004 12:34:36 -0800

On Sat, Feb 28, 2004 at 02:42:47AM -0500, Austin Hastings wrote:
: Another hypothetical:
: 
: Suppose you have a browser (which understands "language" traits)
: or a word processor (which stores "style" and/or "font" information)
: that is storing some not-text-only string-like things via scalar
: strings+ or objectrefs.


Okay, I supposin'.  But I'd rather not call them traits, since that
already means two other things right now.  Properties is more like...

: You want to do something like "search for all occurrences of the word
: 'From:' in a heading style" or "Find all letters 'l' in french text".
: 
: How do you write, and how do you code, the rule(s) for that?

Depends on how you think of the embedded objects.

: I think it could be a rule junction, as C< /all(<french>, 'l')/ >
: but that's not entirely satisfying since I don't imagine that rule
: junctions are going to be the most efficient constructs around. (But
: would a rulejunction be a valid way of searching?)

Not written like that.  At minimum you'd have to put a colon on the
front of that to make it :all.  Except that :all is already taken...

I did get a request once for & to do the opposite of | though.
And one could make an argument that we should reserve :all, :any,
:one and :none for junctional utterances.  In which case what we
currently call :all should probably be :every or :exhaustive or
something else guaranteed to be confused with :e.  :-)

: Alternatively, there would need to be some way of inquiring about
: distributed traits. That is, a trait that wasn't actually applied to
: every single member of a list, but which was "inferred" by some magic
: accessor. (IOW, the "string" object defines a special version of the
: trait accessor method (.AUTOTRAIT anyone?) that knows how to query to
: see if there is a <french>...</french> tagset surrounding this text,
: or whatever.

: With that, you could define a rule called "<french>" and called
: "</french>" that clevery look like XML but invoke the rules. Something
: like:

:   m« <french>l</french>  »

: This has the twin virtues of (1) looking cool; and (2) being really
: self-explanatory. But, how would you code a rule pair like french
: and /french?

That...makes my head hurt.  And will probably make Perl's head hurt too.

: If that's not doable, is there some other way, especially some
: variable way, of checking for "traits" at the same time you're matching
: patterns? (I.e., $language instead of <french>)

If embedded objects are just considered strange characters, and
characters are just considered strange objects, then the most
straightforward way to get object/character properties with set
operations is through the mechanisms that are already there.
For example, to find a french word using character property sets:

    /<<alpha> & <french>>+/;

Your specific example is little more complicated.  Though of course,
since "I" is one letter, one could in this particular case write:

    /<[I] & <french>>+/;


The general solution, however, is:

    /(From\:) <( $1 ~~ /^<headingchar>+$/ )>/

Which seems a bit suboptimal.  The proposed & counterpart to | could
help here:

    / From\: & <headingchar>+ /

the point of & being that all its subpatterns have to start and stop
at the same spot, or it's not a match.  In the way it was originally
posed to me, it was a bioinformatics problem where you want to say
something like:

    /$startseq [ $seqA & $seqB ] $finalseq/

except that that's implying some scanning that the regex engine wouldn't
do by default.  You'd have to say something like:

    /$startseq [ .*? $seqA .*? & .*? $seqB .*? ] $finalseq/

And now you can see how it would be very easy to abuse & badly in terms
of performance.  The above could easily be O(n**4) unless the optimizer
was extremely cagey in factoring out the wildcards into something like:

    /$startseq .*? [
            [$seqA .*? & .*? $seqB ] |
            [$seqB .*? & .*? $seqA ]
        ] .*? $finalseq/

That's still gonna stress the regex engine though.  The efficient way
to solve this particular problem, assuming that $finalseq doesn't
match everywhere, is this:

    /$startseq (.*?) $finalseq <( $1 ~~ /$seqA/ && $1 ~~ /$seqB/ )>/

It's like ordering your expensive tests after your cheap tests in

    if foo() and baz() and bar()

I suppose the regex compiler could guess that a pattern like

    A [ B & C ] D

should be tested

    if A and D and [ B & C ]

But that gets blown to smithereens if D relies on a backref to B or C.
So does any implementation that tries to turn [ B & C ] into a one-pass
state machine.

Still, just because a feature can be abused doesn't mean that it
shouldn't go in.  There's a lot to be said for being able to write
things like:

    [ <ident> & <ascii>+ ]

Now I'm supposing that & binds tighter than | as usual, so the
brackets wouldn't always be necessary:

    <ident> & <french>+
    | 
    <ident> & <swahili>+

Larry

Re: Distributing traits / Rule-matching group properties

Reply via email to