On Wed, 24 Apr 2013, Andrew Talbot wrote:
Hi again, John -
It's a good idea to add the realtime rules to the beginning of the filter. I
didn't realize that would have such an impact.
It will have *some* impact. How much will depend on how many alternates
and how complex they are. It might also be a good idea to put the
more-complex (which is not necessarily longer) alternates at the end of
the list, in the hopes something simpler will hit first.
And the (?=x) tip is a good one too; thank you for that.
That can be extended to two-letter combos as well if a single-letter
grouping doesn't reduce the rule sizes enough.
As far as Bayes, don't get me started! :) I work for an Email Service
Provider and about 2 million messages go through our servers every day, so
we have Bayes turned off because it would be too computationally expensive.
I could see not using Bayes because a widely varied userbase would make it
hard to train effectively, but computationally expensive? Versus long,
fairly complex REs being dynamically updated in "real time"? I suspect
that's not a valid concern; but if anyone has actual stats for how much
load Bayes adds to message processing it would be interesting to see.
How often are you updating the REs and restarting spamd? RE
recompile/reparse and SA restart on a several-times-a-day schedule may be
as much as or more load than Bayes would add.
Also: don't think solely about SA - there are lighter-weight tools if your
primary focus is poison-pill text in the message subject (which all your
examples so far have been). For example, regular expression based Subject
checks in your MTA to cause SMTP-time rejects may be more efficient
overall.
-----Original Message-----
From: John Hardin [mailto:jhar...@impsec.org]
Sent: Wednesday, April 24, 2013 1:53 PM
To: users@spamassassin.apache.org
Subject: RE: More longer rules or fewer shorter ones?
On Wed, 24 Apr 2013, Andrew Talbot wrote:
John,
Thanks for your prompt response!
A lot of the rules are big jumbles of rules we are generating in real
time and adding to as things come in. Like I said in my original
question, we have them separated into separate cf files by category,
and within those cf files they are separated by score. So we have just
absolutely gargantuan rules for (for instance) sex words that we assign a
5 to automatically.
There's also lists of specific words and phrases that we see in
real-time spam (like the *$#ing garden hose spam).
We are just tacking new rules on to the end to make them easier to
read. Our rules properly work with (this|that|theother) if it hits any
one of the words.
Should we maybe have separate rules for all the phrases, since they're
longer strings? There's rules in there that are like RULE Subject =~
/you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah
|blah)
)|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .
Etc. It goes on. .. My syntax is terrible and obviously those aren't
the actual rules but the point is that it's a bunch of "Or" for some
really long strings. Should I separate them out and have those long
(this|that|theother) rules be only for single words?
Simple alternations on phrases are equivalent to simple alternations on
single words with respect to the performance concerns. Performance is more
governed by the number of alternations and the presence of repetition and
.* than their raw length. You might want to limit the total number of
alternations per rule.
Another performance optimization would be to ensure all of the alternations
in a given rule start with the same letter, and put (?=x) before the list of
alternatatives e.g. /\b(?=x)(x1|x2|x3|x4)/ so that the engine can skip more
easily.
If they are simple alternations, it also depends on how you want to score
them.
For "poison pill" words or phrases, sure, a long alternation with a high
score will be pretty efficient. I'd suggest tacking new hits onto the
*front* of the list of alternatives, though, as it's reasonable to assume a
spam run will use the same phrasing for a while, then change.
Alternately, should I separate out the rules with embedded pipes in
them (like in the example above)?
Yeah, avoiding nested alternatives where possible will help.
Is Bayes not catching things like this?
-----Original Message-----
From: John Hardin [mailto:jhar...@impsec.org]
Sent: Wednesday, April 24, 2013 12:58 PM
To: users@spamassassin.apache.org
Subject: Re: More longer rules or fewer shorter ones?
On Wed, 24 Apr 2013, Andrew Talbot wrote:
Hey, all -
I have my customized deployment split up into a bunch of separate CF
files (by category) and I have those further split up into rules
based on
score.
So, I have a bunch of stuff like:
header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
score RULE_1 1
describe RULE_1 Rule 1
header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe
RULE_2 Rule 2
They are WAY longer than that (and some of them include further
nesting of the pipe), but that's the general idea.
My question is: is it better performance-wise to have the rules set
up like this, or to have each separate thing have its own separate rule?
For performance, with simple lists of variant values having no
repetition across the list e.g. (x|y|z){n,m}, if the most-likely
variants are listed first a "big" rule will (generally-speaking)
process less than a set of individual rules for each variant. The big
rule will stop trying as soon as a match for one variant is found,
whereas all of the individual rules must be tried regardless of what
other rules may have hit. RULE_1 won't try matching "that", "theother",
"blah", etc. if "this" matches.
Ignoring performance, the alternatives are *not* syntactically equivalent.
Absent "tflags multiple", RULE_1 would hit only once on a subject
containing both "this" and "that" and "theother", but if you split it
up into separate rules *each* would hit. This likely would affect scoring.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
It is not the business of government to make men virtuous or
religious, or to preserve the fool from the consequences of his own
folly. -- Henry George
-----------------------------------------------------------------------
328 days since the first successful private support mission to ISS (SpaceX)