On 2016-01-20 22:21, Marc Perkel wrote:
Here is a list of 3494938 words and phrases used in the subject line of SPAM and never seen in the subject line of HAM

http://www.junkemailfilter.com/data/subject-spam.txt

I thought I'd take you up on this using a combination of my corpus, and the other mail I have indexed and trivially searchable which is not necessarily corpus quality, but which I can review casually, so I looked through your list of "words and phrases... never seen in the subject line of HAM" that I thought I might find in my collection of ham and here we go:

"alert you have"
"almost done!"
"almost go"
"application declined"
"application support"
"at any time dave" <-- Found one in my own mailbox! Woot!
"audible app" <-- Audible themselves used this in 2014.
"audio with" <-- Are you kidding? A bunch of hits from my mailbox, I see a bunch from OpenBSD's mailing lists, ffmpeg.org, and other places.

My ham indexes are tokenized stripping punctuation, I found over a hundred hits for "almost done" and manually reviewed, I found at least two "almost done!" in the first dozen and got bored. A ton of mail is already excluded for various reasons. For results with a small number, I manually reviewed to rate spamminess, for larger numbers of hits I got bored once I found a few strong hits.

I'm looking for substring matches, not necessarily anchored to the start or end of the subject, but a good chunk of these comprise the entire subject line ("almost done!", "application support" "application declined"), so even if you're not looking at substrings, it's still a sloppy mess.

This is only on a few million messages that comprise a very narrow slice of the mail flow on the internet, and only from those customers where I can query their mail trivially.


Hope you understand it now. Not Bayesian!!!!

Perhaps not, but it seems like it's a natural precursor to a bayesian implementation. As RW said further down the thread:

the only difference between

   "ambulatory care" -> only in ham
   "aall cards"      -> only in spam

and

    "ambulatory care"  occurs 16 times in ham and 0 times in spam
    "aall cards"       occurs  0 times in ham and 3 times in spam

is that you have discarded the count information.

And count information is important in determining the likely trustworthiness of a result. What would your system do with a phrase that appears in thousands of ham messages, and 2 spam messages? Ignore it completely?

--
Dave Warren
http://www.hireahit.com/
http://ca.linkedin.com/in/davejwarren


Reply via email to