On 2024-09-17 at 16:29:52 UTC-0400 (Tue, 17 Sep 2024 16:29:52 -0400)
Alex <mysqlstud...@gmail.com>
is rumored to have said:
It is up to the user, ie you, what is and what is not spam.
Well, yes, and no.
Of course it's my own system and I can define these terms however I
wish.
I'm also familiar with the need to investigate every message - perhaps
I
should have made that clear initially.
It's only these few types of messages that are very subjective and
experience from the broader open source community would be
appreciated.
The debate over the specific definition of "spam" is an old and diverse
conversation. It has damaged friendships and careers.
If it has a legitimate unsubscribe link, does that make it ham?
No.
What criteria do you use to determine "spamminess/haminess of EVERY
message"?
The Official Lumber Cartel acronym for spam is UBE:
Unsolicited: the sender has no sound reason to believe that the target
requested this particular email (or narrowly defined class of email.)
Bulk: the sender appears to have sent substantially the same message to
many different people without meaningful targeting. This can be inferred
from generic content directed at the widest audience, e.g. commercial or
political advertising.
Email: obvious.
Judging that requires some knowledge of the target. I can't tell you
whether your borderline email is spam. Neither can SA, but Bayes is one
way it tries to guess.
Is the goal to have every message one of either BAYES_00 or BAYES_99
or is
it okay that newsletters (for example) are BAYES_50, and let other
rules,
like network checks, determine the score?
The logical model of Naive Bayesian classification is for strictly
binary classes. A message is either ham or spam. Identical messages can
be ham in one mailbox and spam in another, so I suppose one could more
accurately see the classification as being of the combined email and its
envelope of metadata.
Bayesian classification does NOT provide a degree of "spamminess" in
email, it provides a probability of mail being spam. That is a subtle
but important distinction. A 50% Bayes score doesn't mean a message is
semi-spam, it means Bayes cannot tell whether the message is spam. So
yes, it is *OK* that Bayes can't tell whether a newsletter that has
spam-like content but has an unsub link going to a usually-good ESP is
spam or ham. A lot of email is that way: its insane HTML and/or
hype-filled wording smells like spam but since the target wants it, it's
ham.
This is a core design principle in SA: there's no perfect objective test
for spam. That's why we have hundreds of scored rules and sub-rules and
multiple shared reputation tests. A single test (such as Bayes) being
wrong is not a flaw, it is an inescapable attribute of SA's design.
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo@toad.social and many *@billmail.scconsult.com
addresses)
Not Currently Available For Hire