Re: Can your bayes do this?

Dave Warren Sun, 24 Jan 2016 15:45:06 -0800

On 2016-01-20 22:21, Marc Perkel wrote:

Here is a list of 3494938 words and phrases used in the subject lineof SPAM and never seen in the subject line of HAM
http://www.junkemailfilter.com/data/subject-spam.txt

I thought I'd take you up on this using a combination of my corpus, andthe other mail I have indexed and trivially searchable which is notnecessarily corpus quality, but which I can review casually, so I lookedthrough your list of "words and phrases... never seen in the subjectline of HAM" that I thought I might find in my collection of ham andhere we go:


"alert you have"
"almost done!"
"almost go"
"application declined"
"application support"
"at any time dave" <-- Found one in my own mailbox! Woot!
"audible app" <-- Audible themselves used this in 2014.

"audio with" <-- Are you kidding? A bunch of hits from my mailbox, I seea bunch from OpenBSD's mailing lists, ffmpeg.org, and other places.

My ham indexes are tokenized stripping punctuation, I found over ahundred hits for "almost done" and manually reviewed, I found at leasttwo "almost done!" in the first dozen and got bored. A ton of mail isalready excluded for various reasons. For results with a small number, Imanually reviewed to rate spamminess, for larger numbers of hits I gotbored once I found a few strong hits.

I'm looking for substring matches, not necessarily anchored to the startor end of the subject, but a good chunk of these comprise the entiresubject line ("almost done!", "application support" "applicationdeclined"), so even if you're not looking at substrings, it's still asloppy mess.

This is only on a few million messages that comprise a very narrow sliceof the mail flow on the internet, and only from those customers where Ican query their mail trivially.

Hope you understand it now. Not Bayesian!!!!

Perhaps not, but it seems like it's a natural precursor to a bayesianimplementation. As RW said further down the thread:

the only difference between

   "ambulatory care" -> only in ham
   "aall cards"      -> only in spam

and

    "ambulatory care"  occurs 16 times in ham and 0 times in spam
    "aall cards"       occurs  0 times in ham and 3 times in spam

is that you have discarded the count information.

And count information is important in determining the likelytrustworthiness of a result. What would your system do with a phrasethat appears in thousands of ham messages, and 2 spam messages? Ignoreit completely?


--
Dave Warren
http://www.hireahit.com/
http://ca.linkedin.com/in/davejwarren

Re: Can your bayes do this?

Reply via email to