On Fri, 7 Nov 2003, Robert Menschel wrote:

> So you'd be suggesting something like:
> 
> body      T_SAMPLE  /(?:word1|word2|word3|word4|word5)/i
> describe  T_SAMPLE  Message has medical words frequently used in spam
> score     T_SAMPLE  0.5
> accum     T_SAMPLEA ( T_SAMPLE > 5 )
> score     T_SAMPLEA 2.0

I hadn't really worked out details, but I was thinking more along the
lines of

body    T_SAMPLE        /(?:word1|word2|word3|word4|word5)/gi

If the "g" flag is on a regex, in array context perl will return all the
matching substrings, making it trivial to count the number of hits.  
There is currently no reason in SA to use the "g" flag, so there's no
conflict; any rule can be made an accumulating rule by adding "g", and if
it's never tested in a meta-rule it simply acts as a less efficient
true/false for its own score.

Then something like

meta    T_SAMPLEA       ( T_SAMPLE > 10 )
score   T_SAMPLEA       2.0

I don't really see any good reason to accumulate half a point per hit and
then compare to 5; just count the number of hits and compare to 10.

Note, though, that

body    T_SAMPLE        /(?:word1|word2|word3)/gi
meta    T_SAMPLEA       ( T_SAMPLE > 3 )
score   T_SAMPLEA       2.0

would *not* be equivalent to

body    T_SAMPLE1       /word1/i
body    T_SAMPLE2       /word2/i
body    T_SAMPLE3       /word3/i
meta    T_SAMPLEA       (( T_SAMPLE1 + T_SAMPLE2 + T_SAMPLE3 ) > 3)
score   T_SAMPLEA       2.0

because the "g" version scores for 3 repeats of "word1" whereas the
explicit addition scores only if every word appears at least once.

On Fri, 7 Nov 2003, Robert Menschel wrote:

> DBF> A slight modification of the above idea, rather than 'max=2.5' have
> DBF> 'maxhits=5'. IE that particular rule fires no more than 5 times and
> DBF> then the matching engine can drop it and move on to the next rule.
> 
> I like your idea -- improves efficiency by providing a "stop" point,
> while maintaining the ability to reasonably accumulate hits. Thanks.

While a good idea in theory, I rather suspect that the way the perl regex
engine is employed would mean that it's actually MORE expensive for SA to
try to "stop" after a certain number of substrings are matched.  Except in
the case of the header fields, SA is not breaking the message up into
chunks for scanning such that it can stop after the first N chunks -- the
entire message body is handed as one string to each regex match.




-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to