Ryan Kather wrote:
I'll answer some parts...

Yes, from a purely testing perspective. I don't have the liberty of this since I am live production testing. I suppose I could move all received messages for all users through all filters and then only deliver to those users who have opted into the various test breakdowns. I'll look into this, as I think it would be a better picture of the accuracy then a subset of my users.

one thing you could do is also test a combination of SA+dspam:

- dspam per user (with or without initial training). no groups.
- SA (either with a site bayes or with no bayes)
- sa is used to train dspam if SA score is "sure" (<0 or >8, for example). This would implement autolearn for dspam based on SA. - additionally, you can skip SA if dspam confidence is > 0.7 for example (or if the "user dictionary is mature").


My experience is that most spam is detected by both, but:
- sometimes, one of the filters help detect an FP
- sometimes one of the filters detects spam that the other doesn't. (dspam can't do uribl, ...).



Interesting. I hadn't though to choose delivery with LDAP instead of Transports. I suppose this would get rid of my multiple postix instance need. Good suggestion, I'll have to look into this.

No. ldap is a lookup method, not a postfix replacement. you need multiple instances because you need different transport configurations, whether you use mysql, ldap, pgsql, hash, ... etc.


FILTER doesn't work because of multi-recipient mail (only one filter is used per message).


Trap accounts are great, but I always worry they get different spam then real accounts and pollute the Bayesian database. Has anyone experienced this? Also how does SpamAssassin deal with the Bayesian pollution attempts seen recently (spam emails with garbage in them).

I can't say and I too don't trust traps. but in my experiments, they seem to improve accuracy for "immature" users. now spammers could attack traps (by posting "ham text"), but I didn't see that yet.

I could pretty much trust a small subset of users to be fairly
regular in their training.  There is a somewhat larger portion of

They might be telling less trusty users how to take part in the training process, and then break-up your Bayes DB. Those less-smart users should be managed with amaivsd-new LDAP profiling.

LDAP profiling.. Haven't seen examples of that yet. I will definitely research.

I personally think that no user should train anything but his own db. This is why I like the idea of a site-wide + per-user setup. the site-wide db doesn't rely on people. I have tried implementing dspam groups, but couldn't find a safe way.

The individual DBs sound painful, but I'm not sure how consistent our users are. I guess I will have to watch the Bayesian accuracy as it's built and make a decision later.

This is why I like having a combination of site-wide filter + per-user filter. for "lazy" users, the site-wide is enough. for "active" users, the per-user filter will help them to increase filtering accuracy.


It seems as if most of the recommendations advise against trying to force feed your Bayesian. I suppose there's no shortcuts if you want it to be accurate.

the problem is that of symmetry. spam is sent "randomly", so you can use public corpuses (this is not completely true, but is an "acceptable hypothesis" an address that gets a lot of spam). ham, on the other hand, depends closely on the recipient. I use multiple addresses, and I often get the same spam to almost all of them, but with few exceptions, all ham is different (This is one of the reasons I abandoned the idea of dspam groups).


I am beginning to think I won't be able to select new rulesets until the system 
is online, and I have a present metric on it to go by.


you might want to grab the *0.cf SARE rules. except for very few rules which generate FPs, they catch a lot of spam.


Reply via email to