Re: More longer rules or fewer shorter ones?

Andrew Talbot Thu, 25 Apr 2013 15:45:38 -0700

Hi, Martin -

Thank you for your response.

I like your point about the portmanteau rules (and I award you two Points
for using one of my favorite words in a new - yet appropriate - manner!).

I never thought about scoring each rule as a 0.001 or something really low
then tying them all together with meta-rules. It's been a while since I
separated everything out but I believe I have around 1000 different checks
(most of them portmanteau'd) so it seems like those meta rules would just
get ... Messy. But it's a good idea, and I think I can especially make use
of it in my Specific Word list.

It's interesting that you don't use Bayes for the opposite reason that we
don't - we don't do it because of high volume, you don't do it because of
low volume. Go figure.

Keeping the rules under 1-2mb is a good rule of thumb to follow. Luckily
we're nowhere near that point yet.

Can I ask how many rules you have, and how many of those are meta rules?

-----Original Message-----

From: Martin Gregorie [mailto:mar...@gregorie.org]

Sent: Wednesday, April 24, 2013 3:03 PM

To: users@spamassassin.apache.org

Subject: Re: More longer rules or fewer shorter ones?

On Wed, 2013-04-24 at 12:32 -0400, Andrew Talbot wrote:

> I have my customized deployment split up into a bunch of separate CF

> files (by category) and I have those further split up into rules based

> on score.

>

I also use very long rules, mainly due to spamiferous mailing lists,
because all the headers are pretty much the same (apart from sender names),
so about all you're left with for spam recognition is the body content.

I found a problem with very long rules, where for me 'very long' means
"rules longer than the width of my editor's screen". I refer to these as
'portmanteau rules' (private slang). As I hate editing anything that's
longer than my editor's text line and find it particularly annoying to deal
with such a line containing a regex consisting of a lot of alternates, I
wrote a portmanteau rule generator to make their maintenance a bit easier.
It is a gawk script that assembles an arbitrarily long rule from a file
containing rule fragments (regexes,

etc) that are each placed on a separate line. Since sounds as though you
may have a similar problem, you may also find it useful. You can find it
and its documentation here:

http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz

I find it particularly helpful to make the portmanteau rules fairly low
scoring and to combine them into higher scoring meta-rules, e.g. if I'm
trapping sales spiel I'll have a portmanteau rule listing selling phrases,
one containing monetary terms and another containing product terms and
names, all scores at 0.001. I'll also have a meta-rule that ANDs these
three rules together and scores around 5. This approach is much better at
distinguishing spam from ham than a series of higher scoring non-meta rules
and has the additional benefit of recognising sales-related text from
previously unseen combinations of elements in the three rules.

BTW, I don't use Bayes because my mail volume is small and I have
difficulty collecting decent training corpuses and find my current setup
easier to manage.

  They are WAY longer than that (and some of them include further nesting
of the pipe), but that's the general idea.

> My question is: is it better performance-wise to have the rules set up

> like this, or to have each separate thing have its own separate rule?

>

What JH said. When I was thinking of setting up this approach I asked about
performance and limits on the size of the generated rules and was told that
I shouldn't worry about rule size until they exceeded a megabyte or two.
Currently my longest rule is just over 9KB, with the averages being just
under 1KB and 51 alternates per rule.

Martin

Re: More longer rules or fewer shorter ones?

Reply via email to