On 1/17/2023 7:33 AM, David Bürgin wrote:
I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.
But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.
Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?
The problem with auto learning I've seen is that it slowly spirals
miscategorization errors. The technical term is that it reinforces a
bias. A Bayes database should be carefully maintained. It's not very
much of a fire and forget technology.
And, for example, letting user's control it becomes a question of "what
is spam?" For example, users might get a very legit mail BUT they are
tired of seeing it in their inbox. So they want to train it as spam.
If you have per-user implementations, that can be good BUT you need a
few hundred samples of good email and bad email to activate Bayes.
In short, I don't have a good solution for training Bayes that isn't a
lot of work but auto-learning is usually a bad solution.
One case where it might be good is if you had a system setup that you
fed emails to that were classified. It would then use that good feed to
use the auto-learning and add a way of learning without using the
command line.
Regards,
KAM
--
Kevin A. McGrail
kmcgr...@apache.org
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171