Hi, Martin -
Thank you for your response. I like your point about the portmanteau rules (and I award you two Points for using one of my favorite words in a new - yet appropriate - manner!). I never thought about scoring each rule as a 0.001 or something really low then tying them all together with meta-rules. It's been a while since I separated everything out but I believe I have around 1000 different checks (most of them portmanteau'd) so it seems like those meta rules would just get ... Messy. But it's a good idea, and I think I can especially make use of it in my Specific Word list. It's interesting that you don't use Bayes for the opposite reason that we don't - we don't do it because of high volume, you don't do it because of low volume. Go figure. Keeping the rules under 1-2mb is a good rule of thumb to follow. Luckily we're nowhere near that point yet. Can I ask how many rules you have, and how many of those are meta rules? -----Original Message----- From: Martin Gregorie [mailto:mar...@gregorie.org] Sent: Wednesday, April 24, 2013 3:03 PM To: users@spamassassin.apache.org Subject: Re: More longer rules or fewer shorter ones? On Wed, 2013-04-24 at 12:32 -0400, Andrew Talbot wrote: > I have my customized deployment split up into a bunch of separate CF > files (by category) and I have those further split up into rules based > on score. > I also use very long rules, mainly due to spamiferous mailing lists, because all the headers are pretty much the same (apart from sender names), so about all you're left with for spam recognition is the body content. I found a problem with very long rules, where for me 'very long' means "rules longer than the width of my editor's screen". I refer to these as 'portmanteau rules' (private slang). As I hate editing anything that's longer than my editor's text line and find it particularly annoying to deal with such a line containing a regex consisting of a lot of alternates, I wrote a portmanteau rule generator to make their maintenance a bit easier. It is a gawk script that assembles an arbitrarily long rule from a file containing rule fragments (regexes, etc) that are each placed on a separate line. Since sounds as though you may have a similar problem, you may also find it useful. You can find it and its documentation here: http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz I find it particularly helpful to make the portmanteau rules fairly low scoring and to combine them into higher scoring meta-rules, e.g. if I'm trapping sales spiel I'll have a portmanteau rule listing selling phrases, one containing monetary terms and another containing product terms and names, all scores at 0.001. I'll also have a meta-rule that ANDs these three rules together and scores around 5. This approach is much better at distinguishing spam from ham than a series of higher scoring non-meta rules and has the additional benefit of recognising sales-related text from previously unseen combinations of elements in the three rules. BTW, I don't use Bayes because my mail volume is small and I have difficulty collecting decent training corpuses and find my current setup easier to manage. They are WAY longer than that (and some of them include further nesting of the pipe), but that's the general idea. > My question is: is it better performance-wise to have the rules set up > like this, or to have each separate thing have its own separate rule? > What JH said. When I was thinking of setting up this approach I asked about performance and limits on the size of the generated rules and was told that I shouldn't worry about rule size until they exceeded a megabyte or two. Currently my longest rule is just over 9KB, with the averages being just under 1KB and 51 alternates per rule. Martin