On Fri, 7 Nov 2003, Robert Menschel wrote: > So you'd be suggesting something like: > > body T_SAMPLE /(?:word1|word2|word3|word4|word5)/i > describe T_SAMPLE Message has medical words frequently used in spam > score T_SAMPLE 0.5 > accum T_SAMPLEA ( T_SAMPLE > 5 ) > score T_SAMPLEA 2.0
I hadn't really worked out details, but I was thinking more along the lines of body T_SAMPLE /(?:word1|word2|word3|word4|word5)/gi If the "g" flag is on a regex, in array context perl will return all the matching substrings, making it trivial to count the number of hits. There is currently no reason in SA to use the "g" flag, so there's no conflict; any rule can be made an accumulating rule by adding "g", and if it's never tested in a meta-rule it simply acts as a less efficient true/false for its own score. Then something like meta T_SAMPLEA ( T_SAMPLE > 10 ) score T_SAMPLEA 2.0 I don't really see any good reason to accumulate half a point per hit and then compare to 5; just count the number of hits and compare to 10. Note, though, that body T_SAMPLE /(?:word1|word2|word3)/gi meta T_SAMPLEA ( T_SAMPLE > 3 ) score T_SAMPLEA 2.0 would *not* be equivalent to body T_SAMPLE1 /word1/i body T_SAMPLE2 /word2/i body T_SAMPLE3 /word3/i meta T_SAMPLEA (( T_SAMPLE1 + T_SAMPLE2 + T_SAMPLE3 ) > 3) score T_SAMPLEA 2.0 because the "g" version scores for 3 repeats of "word1" whereas the explicit addition scores only if every word appears at least once. On Fri, 7 Nov 2003, Robert Menschel wrote: > DBF> A slight modification of the above idea, rather than 'max=2.5' have > DBF> 'maxhits=5'. IE that particular rule fires no more than 5 times and > DBF> then the matching engine can drop it and move on to the next rule. > > I like your idea -- improves efficiency by providing a "stop" point, > while maintaining the ability to reasonably accumulate hits. Thanks. While a good idea in theory, I rather suspect that the way the perl regex engine is employed would mean that it's actually MORE expensive for SA to try to "stop" after a certain number of substrings are matched. Except in the case of the header fields, SA is not breaking the message up into chunks for scanning such that it can stop after the first N chunks -- the entire message body is handed as one string to each regex match. ------------------------------------------------------- This SF.Net email sponsored by: ApacheCon 2003, 16-19 November in Las Vegas. Learn firsthand the latest developments in Apache, PHP, Perl, XML, Java, MySQL, WebDAV, and more! http://www.apachecon.com/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk