On Thu, 11 Dec 2003 08:42:16 -0800, "Gary Funck" <[EMAIL PROTECTED]> writes:

> > 
> > > One implementation might be to convert the rewrite rules into an
> > > equivalent flex description, and let flex generate the automaton in
> > > C. Compile the C, and build a Perl binding to it.
> 
> Scott replied:
> > I considered that and did a prototype (which was useful for
> > performance estimates and automata-size estimates) before deciding
> > that it won't work well. In particular, this can't handle overlapping
> > matches or cases where more than one rule matches.
> >
> 
> OK. In your ealier example it looked as if your goal was to focus on
> rewrite rules which took a stream in and made simple transformations
> to produce a new (unobfuscated) output stream. But from above it seems
> that you would like to rewrite existing SA scoring patterns into some
> sort of automaton that will traverse the input stream in a more/less
> single pass, outputing a list of rules that were matched?

Actually, there are several things I'm doing. The first is a parallel
match compiler that can match many rules in parallel by first building
an automata. Ignoring overlapping patterns, this can be done in one
pass. (with a small amount of backtracking for ambigious patterns. I
don't expect that issue to be a concern at all in practice.) For
overlapping patterns, I would need to restart the automata at all
offsets. I could also bound the maximum lengh of a match.to 100 or
something.

A second algorithm could be based around Aho Corasick which should be
able to handle overlapping substring matching in a single pass with no
backtracking! This would allow matching an arbitrary number of
SUBSTRING.{,12}SUBSTRING patterns. Those are *very* frequent in SA. 

A third algorithm would be a rule transform that given a 'base'
pattern would construct variants. These would then be fed into one of
the two matching algorithms. For obfuscated phrase rules, they would
be transformed into re's and matched with the first algorithm
directly. For greater performance, an alternative would be to
canonicalize the text, then do a parallel single-pass Aho-Corasick ---
if I get around to implementing it. I won't implement this
until/unless the re automata compiler fails to scale, which from
prototypes should be at at least 10,000 or so re's.

> Note that if you want to handle cases where more than one rule matches,
> you'll likely have to implement backtracking.

Yes. Thats pretty much a given. Because almost all of the time every
rule won't match, it'll backtrack to the default 'gobble one
character' rule and then try the whole thing again. Thus, explicitly
trying all offsets won't be much worse than this already existing
typical case. As a practical matter, this isn't a concern.

> > It also blows up with an idiom used all over SA 'foo.{,20}bar' This
> > can be worked around by two rules:
> > 
> > foo   { mark current position as a match for foo }
> > bar   { check if current position is within 20 of marked position 
> > for foo and record a rule match if so.}
> > 
> 
> A problem with that approach is the automaton might keep scanning
> all the way to the end of the input stream rather than bounding
> the search. This leads to a lot of extra work.

Yes. A deeper problem is that .{...} is a bad idea in general.
Accepting it as a shorthand convenience syntax for patterns where its
the counter is small is OK, but in general, its dangerous.

r* can bad from an automata-generation perspective, if carefully
abused. But from a matching perspective, it can change the matching
time from typically linear to quadratic. Fortunately, it is possible
to (and I plan on) bounding the match length.

> I don't know how the regex package implements the {m,n} construct,

Nondeterministic backtracking. See my post a day or two ago outlining
how the Perl RE engine works.

> but one way is to simply enumerate the various interim wild card
> characters:
>    a{5}b -> (a | aa | aaa | aaaa | aaaaa)b

This is how an automata compiler does it. And this is why flex blows
up on rules like 'bang.{,20}bar', which is why my prototype rewrote
them to use two re's.

Automata can't do every pattern. On the other hand, they could do
almost all of the SA content rules, plus thousands more in parallel,
and do it orders of magnitude faster.

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to