On Thu, 11 Dec 2003 08:42:16 -0800, "Gary Funck" <[EMAIL PROTECTED]> writes:
> > > > > One implementation might be to convert the rewrite rules into an > > > equivalent flex description, and let flex generate the automaton in > > > C. Compile the C, and build a Perl binding to it. > > Scott replied: > > I considered that and did a prototype (which was useful for > > performance estimates and automata-size estimates) before deciding > > that it won't work well. In particular, this can't handle overlapping > > matches or cases where more than one rule matches. > > > > OK. In your ealier example it looked as if your goal was to focus on > rewrite rules which took a stream in and made simple transformations > to produce a new (unobfuscated) output stream. But from above it seems > that you would like to rewrite existing SA scoring patterns into some > sort of automaton that will traverse the input stream in a more/less > single pass, outputing a list of rules that were matched? Actually, there are several things I'm doing. The first is a parallel match compiler that can match many rules in parallel by first building an automata. Ignoring overlapping patterns, this can be done in one pass. (with a small amount of backtracking for ambigious patterns. I don't expect that issue to be a concern at all in practice.) For overlapping patterns, I would need to restart the automata at all offsets. I could also bound the maximum lengh of a match.to 100 or something. A second algorithm could be based around Aho Corasick which should be able to handle overlapping substring matching in a single pass with no backtracking! This would allow matching an arbitrary number of SUBSTRING.{,12}SUBSTRING patterns. Those are *very* frequent in SA. A third algorithm would be a rule transform that given a 'base' pattern would construct variants. These would then be fed into one of the two matching algorithms. For obfuscated phrase rules, they would be transformed into re's and matched with the first algorithm directly. For greater performance, an alternative would be to canonicalize the text, then do a parallel single-pass Aho-Corasick --- if I get around to implementing it. I won't implement this until/unless the re automata compiler fails to scale, which from prototypes should be at at least 10,000 or so re's. > Note that if you want to handle cases where more than one rule matches, > you'll likely have to implement backtracking. Yes. Thats pretty much a given. Because almost all of the time every rule won't match, it'll backtrack to the default 'gobble one character' rule and then try the whole thing again. Thus, explicitly trying all offsets won't be much worse than this already existing typical case. As a practical matter, this isn't a concern. > > It also blows up with an idiom used all over SA 'foo.{,20}bar' This > > can be worked around by two rules: > > > > foo { mark current position as a match for foo } > > bar { check if current position is within 20 of marked position > > for foo and record a rule match if so.} > > > > A problem with that approach is the automaton might keep scanning > all the way to the end of the input stream rather than bounding > the search. This leads to a lot of extra work. Yes. A deeper problem is that .{...} is a bad idea in general. Accepting it as a shorthand convenience syntax for patterns where its the counter is small is OK, but in general, its dangerous. r* can bad from an automata-generation perspective, if carefully abused. But from a matching perspective, it can change the matching time from typically linear to quadratic. Fortunately, it is possible to (and I plan on) bounding the match length. > I don't know how the regex package implements the {m,n} construct, Nondeterministic backtracking. See my post a day or two ago outlining how the Perl RE engine works. > but one way is to simply enumerate the various interim wild card > characters: > a{5}b -> (a | aa | aaa | aaaa | aaaaa)b This is how an automata compiler does it. And this is why flex blows up on rules like 'bang.{,20}bar', which is why my prototype rewrote them to use two re's. Automata can't do every pattern. On the other hand, they could do almost all of the SA content rules, plus thousands more in parallel, and do it orders of magnitude faster. Scott ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk