On Sat, Feb 28, 2004 at 02:42:47AM -0500, Austin Hastings wrote: : Another hypothetical: : : Suppose you have a browser (which understands "language" traits) : or a word processor (which stores "style" and/or "font" information) : that is storing some not-text-only string-like things via scalar : strings+ or objectrefs.
Okay, I supposin'. But I'd rather not call them traits, since that already means two other things right now. Properties is more like... : You want to do something like "search for all occurrences of the word : 'From:' in a heading style" or "Find all letters 'l' in french text". : : How do you write, and how do you code, the rule(s) for that? Depends on how you think of the embedded objects. : I think it could be a rule junction, as C< /all(<french>, 'l')/ > : but that's not entirely satisfying since I don't imagine that rule : junctions are going to be the most efficient constructs around. (But : would a rulejunction be a valid way of searching?) Not written like that. At minimum you'd have to put a colon on the front of that to make it :all. Except that :all is already taken... I did get a request once for & to do the opposite of | though. And one could make an argument that we should reserve :all, :any, :one and :none for junctional utterances. In which case what we currently call :all should probably be :every or :exhaustive or something else guaranteed to be confused with :e. :-) : Alternatively, there would need to be some way of inquiring about : distributed traits. That is, a trait that wasn't actually applied to : every single member of a list, but which was "inferred" by some magic : accessor. (IOW, the "string" object defines a special version of the : trait accessor method (.AUTOTRAIT anyone?) that knows how to query to : see if there is a <french>...</french> tagset surrounding this text, : or whatever. : With that, you could define a rule called "<french>" and called : "</french>" that clevery look like XML but invoke the rules. Something : like: : m« <french>l</french> » : This has the twin virtues of (1) looking cool; and (2) being really : self-explanatory. But, how would you code a rule pair like french : and /french? That...makes my head hurt. And will probably make Perl's head hurt too. : If that's not doable, is there some other way, especially some : variable way, of checking for "traits" at the same time you're matching : patterns? (I.e., $language instead of <french>) If embedded objects are just considered strange characters, and characters are just considered strange objects, then the most straightforward way to get object/character properties with set operations is through the mechanisms that are already there. For example, to find a french word using character property sets: /<<alpha> & <french>>+/; Your specific example is little more complicated. Though of course, since "I" is one letter, one could in this particular case write: /<[I] & <french>>+/; The general solution, however, is: /(From\:) <( $1 ~~ /^<headingchar>+$/ )>/ Which seems a bit suboptimal. The proposed & counterpart to | could help here: / From\: & <headingchar>+ / the point of & being that all its subpatterns have to start and stop at the same spot, or it's not a match. In the way it was originally posed to me, it was a bioinformatics problem where you want to say something like: /$startseq [ $seqA & $seqB ] $finalseq/ except that that's implying some scanning that the regex engine wouldn't do by default. You'd have to say something like: /$startseq [ .*? $seqA .*? & .*? $seqB .*? ] $finalseq/ And now you can see how it would be very easy to abuse & badly in terms of performance. The above could easily be O(n**4) unless the optimizer was extremely cagey in factoring out the wildcards into something like: /$startseq .*? [ [$seqA .*? & .*? $seqB ] | [$seqB .*? & .*? $seqA ] ] .*? $finalseq/ That's still gonna stress the regex engine though. The efficient way to solve this particular problem, assuming that $finalseq doesn't match everywhere, is this: /$startseq (.*?) $finalseq <( $1 ~~ /$seqA/ && $1 ~~ /$seqB/ )>/ It's like ordering your expensive tests after your cheap tests in if foo() and baz() and bar() I suppose the regex compiler could guess that a pattern like A [ B & C ] D should be tested if A and D and [ B & C ] But that gets blown to smithereens if D relies on a backref to B or C. So does any implementation that tries to turn [ B & C ] into a one-pass state machine. Still, just because a feature can be abused doesn't mean that it shouldn't go in. There's a lot to be said for being able to write things like: [ <ident> & <ascii>+ ] Now I'm supposing that & binds tighter than | as usual, so the brackets wouldn't always be necessary: <ident> & <french>+ | <ident> & <swahili>+ Larry