On Tue, 25 Nov 2003 23:04:03 -0500, Pedro Sam <[EMAIL PROTECTED]> writes:

> On November 25, 2003 10:31 pm, Alexander Litvinov wrote:
> > Heh, it seems it would be nice to make SA scan messages fatser. If I
> > undersand 
 your idea correctly, you want not to run regexp one by one, but
> > write the state machine for all regepes and walk on this states by the
> > mail, but... I undersand how this may be faster (liner time of the message
> > size) if SA had one rexep for detecting spam. Now SA have muliplt rules
> > that can be fired simultaneously. For this situation I can't imaging the
> > way to write the state machine.
> 
> hehe, we can give a state to each possible combination of HITS for
> the rules.  So if rules 1, 3, 5, 7 hit, we give that a state, and if
> 2, 4, 6, 8 hit, we give it another state, and so on... 

Replace 1,2,3,4,5,6,7 with carefully chosen characters, taken from the
set of all regexps to be matched in parallel.

> I think they call it subset construction or something...

Yes.

> Speaking of which, can't we use lex or something for this?

Not directly. lex/flex cannot handle overlapping patterns. Also, it
does a longest match rule, so two patterns where one matches a prefix
of another will only give you the longer one. I did propose flex about
a year ago. 

Rerunning the automata at all offsets can solve the first
problem. (and as the automata will usually terminate with no match,
this won't be much of a degradation in performance.) 

I believe I have a construction that can remove the second problem. 

> Just out of curiosity (as I'm not a perl expert), how ARE things
> done currently that is different from the DFA/NFA approach that
> Scott is thinking about?

Perl compiles the DFA into a sort of bytecode, and then 'interprets'
it over the input string, trying all possible offsets. If it gets a
failed match, it backtracks and tries an alternative. It includes
optimizations to detect literal substrings and other techniques so
simple regexps like:

  (a|aa)*[XY]  won't take exponential time on something like:
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaO

In practice, the SA regexp's aren't the sort that will cause the perl
regexp engine to be exercised except under deliberate attack.
Individually, for the patterns SA uses, on a pattern-by-pattern basis,
perl is very fast, but since SA uses several hundred patterns, it is
several hundred times slower. Perl uses this engine because it lets
them recognize irregular languages like /(['"])foo\1/

Scott


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to