On Tue, 25 Nov 2003 23:04:03 -0500, Pedro Sam <[EMAIL PROTECTED]> writes:
> On November 25, 2003 10:31 pm, Alexander Litvinov wrote: > > Heh, it seems it would be nice to make SA scan messages fatser. If I > > undersand your idea correctly, you want not to run regexp one by one, but > > write the state machine for all regepes and walk on this states by the > > mail, but... I undersand how this may be faster (liner time of the message > > size) if SA had one rexep for detecting spam. Now SA have muliplt rules > > that can be fired simultaneously. For this situation I can't imaging the > > way to write the state machine. > > hehe, we can give a state to each possible combination of HITS for > the rules. So if rules 1, 3, 5, 7 hit, we give that a state, and if > 2, 4, 6, 8 hit, we give it another state, and so on... Replace 1,2,3,4,5,6,7 with carefully chosen characters, taken from the set of all regexps to be matched in parallel. > I think they call it subset construction or something... Yes. > Speaking of which, can't we use lex or something for this? Not directly. lex/flex cannot handle overlapping patterns. Also, it does a longest match rule, so two patterns where one matches a prefix of another will only give you the longer one. I did propose flex about a year ago. Rerunning the automata at all offsets can solve the first problem. (and as the automata will usually terminate with no match, this won't be much of a degradation in performance.) I believe I have a construction that can remove the second problem. > Just out of curiosity (as I'm not a perl expert), how ARE things > done currently that is different from the DFA/NFA approach that > Scott is thinking about? Perl compiles the DFA into a sort of bytecode, and then 'interprets' it over the input string, trying all possible offsets. If it gets a failed match, it backtracks and tries an alternative. It includes optimizations to detect literal substrings and other techniques so simple regexps like: (a|aa)*[XY] won't take exponential time on something like: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaO In practice, the SA regexp's aren't the sort that will cause the perl regexp engine to be exercised except under deliberate attack. Individually, for the patterns SA uses, on a pattern-by-pattern basis, perl is very fast, but since SA uses several hundred patterns, it is several hundred times slower. Perl uses this engine because it lets them recognize irregular languages like /(['"])foo\1/ Scott ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk