Re: exegesis 5 question: matching negative, multi-byte strings

esp5 Tue, 01 Oct 2002 18:23:57 -0700

On Tue, Oct 01, 2002 at 06:32:07PM -0400, Mike Lambert wrote:
> > guaranteeing that the subsqls have all text up to, but not including the string
> > "union".
> >
> > I suppose I could say:
> >
> >     rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } }
> 
> What's wrong with: ?
> 
> rule getstuffbeforeunion { (.*?) union | (.*) }
> 
> "a union" => "a "
> "b" => "b"
> 
> Am I missing something here?
> 
> Mike Lambert
>


hmm... well, it works, but its not very efficient. It basically 
scans the whole string to the end to see if there is a "union" string, and 
then backtracks to take the alternative. And hence, its not very scalable. 
It also doesn't 'complexify' very well.

Suppose you had a long string of text, and you wanted to 'harden' your regex
against the substring union appearing in double-quoted strings, single-quoted 
strings, etc. etc, without writing a sql parser. I just don't see how to do this
with ? - I would do something like (taking a page from Mr. Friedl's book ) - 

rule regex_matching_sql 
{
        [
                <-[u()"']>+     : |
                <parens>        : |
                <double_string> : |
                <single_string> : |
                <non_union>
        ]*
}

rule parens
{
        \(
                [
                        <-["'()]>+          : |
                        <double_string> : |
                        <single_string> : |
                        <self> 
                ]*
        \)
}

rule single_string
{
        \' [ <-[\'\\]>+ : | \.\' ]* \'
}

rule double_string
{
        \" [ <-[\"\\]>+ : | \.\" ]* \"
}

rule non_union {  [ u < - ['"()n] > | un ... | uni ... | unio ... | u$ ] * }

Of course I could also be missing something, but I just don't see how to do this
with .*?. 

Ed

(ps:
As for:

        /(.*) <commit> <!{ $1 =~ rx{union} }>/

I'm not sure how that works; and whether or not its very 'complexifiable' 
(as per above) . If it does a match against every single substring (take all 
characters, look for union, if it exists, roll back a character, do 
the same thing, etc. etc. etc.) then this isn't good enough.  The non_union 
rule listed above is about as efficient as it can get; it does no backtracking,
and it keeps the common matches up front so they match first without 
alternation.
)

Re: exegesis 5 question: matching negative, multi-byte strings

Reply via email to