On 1/17/2023 7:33 AM, David Bürgin wrote:
I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.

But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.

Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?

The problem with auto learning I've seen is that it slowly spirals miscategorization errors.  The technical term is that it reinforces a bias.  A Bayes database should be carefully maintained.  It's not very much of a fire and forget technology.

And, for example, letting user's control it becomes a question of "what is spam?"  For example, users might get a very legit mail BUT they are tired of seeing it in their inbox.  So they want to train it as spam.  If you have per-user implementations, that can be good BUT you need a few hundred samples of good email and bad email to activate Bayes.

In short, I don't have a good solution for training Bayes that isn't a lot of work but auto-learning is usually a bad solution.

One case where it might be good is if you had a system setup that you fed emails to that were classified.  It would then use that good feed to use the auto-learning and add a way of learning without using the command line.

Regards,
KAM

--
Kevin A. McGrail
kmcgr...@apache.org

Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

Reply via email to