Ryan Kather wrote:
I'll answer some parts...
Yes, from a purely testing perspective. I don't have the liberty of this since I am live production testing. I suppose I could move all received messages for all users through all filters and then only deliver to those users who have opted into the various test breakdowns. I'll look into this, as I think it would be a better picture of the accuracy then a subset of my users.
one thing you could do is also test a combination of SA+dspam:
- dspam per user (with or without initial training). no groups.
- SA (either with a site bayes or with no bayes)
- sa is used to train dspam if SA score is "sure" (<0 or >8, for
example). This would implement autolearn for dspam based on SA.
- additionally, you can skip SA if dspam confidence is > 0.7 for example
(or if the "user dictionary is mature").
My experience is that most spam is detected by both, but:
- sometimes, one of the filters help detect an FP
- sometimes one of the filters detects spam that the other doesn't.
(dspam can't do uribl, ...).
Interesting. I hadn't though to choose delivery with LDAP instead of Transports. I suppose this would get rid of my multiple postix instance need. Good suggestion, I'll have to look into this.
No. ldap is a lookup method, not a postfix replacement. you need
multiple instances because you need different transport configurations,
whether you use mysql, ldap, pgsql, hash, ... etc.
FILTER doesn't work because of multi-recipient mail (only one filter is
used per message).
Trap accounts are great, but I always worry they get different spam then real accounts and pollute the Bayesian database. Has anyone experienced this? Also how does SpamAssassin deal with the Bayesian pollution attempts seen recently (spam emails with garbage in them).
I can't say and I too don't trust traps. but in my experiments, they
seem to improve accuracy for "immature" users. now spammers could attack
traps (by posting "ham text"), but I didn't see that yet.
I could pretty much trust a small subset of users to be fairly
regular in their training. There is a somewhat larger portion of
They might be telling less trusty users how to take part in the training
process, and then break-up your Bayes DB. Those less-smart users should
be managed with amaivsd-new LDAP profiling.
LDAP profiling.. Haven't seen examples of that yet. I will definitely research.
I personally think that no user should train anything but his own db.
This is why I like the idea of a site-wide + per-user setup. the
site-wide db doesn't rely on people. I have tried implementing dspam
groups, but couldn't find a safe way.
The individual DBs sound painful, but I'm not sure how consistent our users are. I guess I will have to watch the Bayesian accuracy as it's built and make a decision later.
This is why I like having a combination of site-wide filter + per-user
filter. for "lazy" users, the site-wide is enough. for "active" users,
the per-user filter will help them to increase filtering accuracy.
It seems as if most of the recommendations advise against trying to force feed your Bayesian. I suppose there's no shortcuts if you want it to be accurate.
the problem is that of symmetry. spam is sent "randomly", so you can use
public corpuses (this is not completely true, but is an "acceptable
hypothesis" an address that gets a lot of spam). ham, on the other hand,
depends closely on the recipient. I use multiple addresses, and I often
get the same spam to almost all of them, but with few exceptions, all
ham is different (This is one of the reasons I abandoned the idea of
dspam groups).
I am beginning to think I won't be able to select new rulesets until the system
is online, and I have a present metric on it to go by.
you might want to grab the *0.cf SARE rules. except for very few rules
which generate FPs, they catch a lot of spam.