Re: Suggestion to developers

Matt Kettler Thu, 13 Sep 2007 10:10:49 -0700

Crocomoth wrote:
> Matt Kettler-3 wrote:
>   
>>> 1. Using this method, admin must understand that the fate of every
>>> message
>>> (for all users) will depend from the single rule.
>>>       
>> Not if you set it up properly..  You can have multiple rules run with a
>> very early priority (low number), then have another one run with a
>> semi-early priority which does shortcircuiting. All of the "very early"
>> rules will be involved in the decision to shortcircuit or not.
>>
>>     
>
> Yes, but low-numbered rules may not generate any points and the desision may
> depend from one rule anyways. This does not change anything. And what is
> more (see (2) with which you have agreed), in default configuration, this
> will be bayes which generates only 3.5 points (not taking into account
> while/black lists because they will not be set up properly in most cases). 
> And, I think, number of persons not wishing to reorder standard rules will
> be much more than "semi-professional" admins.
>


True, but your automated method based on sorting them on "weight" would
pretty much grind spamassassin to a screeching halt by increasing the
average scan time due to forcing multiple passes through the message.
Not to mention false positive problems if negative-scoring rules end up
being considered "heavy" and don't get run.

Your idea essentially ruins any benefits of memory caching that
SpamAssassin currently exploits. Right now, rules are run in groups
based on what part of the message they need. This lends speed to
spamassassin by allowing that portion of the mesage to already be in
cache for all but the first rule in the group.

If you start jumping around all over the message for different rules,
the processor memory cache quickly becomes full and pushes out parts
that you're going to be looking at again. If you keep going
back-and-forth header, body, header, body, header, body.. you wind up
going out to ram quite often, and that's painfully slow. (I don't care
what high-speed dual-channel ddr2 memory setup you have, it's abysmally
slow from the processors perspective, generally 20 times slower than
cache is)

Sure, some messages will bail out faster, but most messages will take
much longer to scan. How is that better?

I don't debate that the basic idea of having SA do this "automagically"
would be a great thing. However, the reality of doing it efficiently is
much trickier than you think.

At one point, one idea was to run all the negative scoring rules, and
then run the positive scoring ones, and bail out if the score went over
the spam threshold during the positive phase.

The end result of that test was abysmally slow, due to having to scan
the message in two passes (negative header, negative body, positive
header, positive body).

> Sort order may be: negative rules, sorted positive common rules. Any
> user-defined rules should be checked after negative ones and before
> positives, if exists. Of course, sorting should be performed once upon load
> procedure.
Tested, as mentioned above. Resulted in horrible performance due to
over-sorting.

> Or, such a cut-off may work without any sorting; this is optional. Standard
> priorities could be enough, if they set up.
I'd agree there. SA could exploit priorities better in the default
config, but this kind of thing needs to be done very carefuly to avoid
thrashing the processor cache. Any simple "sort by.." is going to result
in terrible performance.

Re: Suggestion to developers

Reply via email to