Re: Tips on training bayes?

Bill Cole Thu, 19 Sep 2024 06:06:57 -0700

On 2024-09-17 at 16:29:52 UTC-0400 (Tue, 17 Sep 2024 16:29:52 -0400)
Alex <mysqlstud...@gmail.com>
is rumored to have said:

It is up to the user, ie you, what is and what is not spam.
Well, yes, and no.
Of course it's my own system and I can define these terms however Iwish.I'm also familiar with the need to investigate every message - perhapsI
should have made that clear initially.

It's only these few types of messages that are very subjective and
experience from the broader open source community would beappreciated.

The debate over the specific definition of "spam" is an old and diverseconversation. It has damaged friendships and careers.

If it has a legitimate unsubscribe link, does that make it ham?

No.

What criteria do you use to determine "spamminess/haminess of EVERY
message"?


The Official Lumber Cartel acronym for spam is UBE:

Unsolicited: the sender has no sound reason to believe that the targetrequested this particular email (or narrowly defined class of email.)

Bulk: the sender appears to have sent substantially the same message tomany different people without meaningful targeting. This can be inferredfrom generic content directed at the widest audience, e.g. commercial orpolitical advertising.


Email: obvious.

Judging that requires some knowledge of the target. I can't tell youwhether your borderline email is spam. Neither can SA, but Bayes is oneway it tries to guess.

Is the goal to have every message one of either BAYES_00 or BAYES_99or isit okay that newsletters (for example) are BAYES_50, and let otherrules,
like network checks, determine the score?

The logical model of Naive Bayesian classification is for strictlybinary classes. A message is either ham or spam. Identical messages canbe ham in one mailbox and spam in another, so I suppose one could moreaccurately see the classification as being of the combined email and itsenvelope of metadata.

Bayesian classification does NOT provide a degree of "spamminess" inemail, it provides a probability of mail being spam. That is a subtlebut important distinction. A 50% Bayes score doesn't mean a message issemi-spam, it means Bayes cannot tell whether the message is spam. Soyes, it is *OK* that Bayes can't tell whether a newsletter that hasspam-like content but has an unsub link going to a usually-good ESP isspam or ham. A lot of email is that way: its insane HTML and/orhype-filled wording smells like spam but since the target wants it, it'sham.

This is a core design principle in SA: there's no perfect objective testfor spam. That's why we have hundreds of scored rules and sub-rules andmultiple shared reputation tests. A single test (such as Bayes) beingwrong is not a flaw, it is an inescapable attribute of SA's design.



--
Bill Cole
b...@scconsult.com or billc...@apache.org

(AKA @grumpybozo@toad.social and many *@billmail.scconsult.comaddresses)

Not Currently Available For Hire

Re: Tips on training bayes?

Reply via email to