On 2016-01-20 22:21, Marc Perkel wrote:
Here is a list of 3494938 words and phrases used in the subject line
of SPAM and never seen in the subject line of HAM
http://www.junkemailfilter.com/data/subject-spam.txt
I thought I'd take you up on this using a combination of my corpus, and
the other mail I have indexed and trivially searchable which is not
necessarily corpus quality, but which I can review casually, so I looked
through your list of "words and phrases... never seen in the subject
line of HAM" that I thought I might find in my collection of ham and
here we go:
"alert you have"
"almost done!"
"almost go"
"application declined"
"application support"
"at any time dave" <-- Found one in my own mailbox! Woot!
"audible app" <-- Audible themselves used this in 2014.
"audio with" <-- Are you kidding? A bunch of hits from my mailbox, I see
a bunch from OpenBSD's mailing lists, ffmpeg.org, and other places.
My ham indexes are tokenized stripping punctuation, I found over a
hundred hits for "almost done" and manually reviewed, I found at least
two "almost done!" in the first dozen and got bored. A ton of mail is
already excluded for various reasons. For results with a small number, I
manually reviewed to rate spamminess, for larger numbers of hits I got
bored once I found a few strong hits.
I'm looking for substring matches, not necessarily anchored to the start
or end of the subject, but a good chunk of these comprise the entire
subject line ("almost done!", "application support" "application
declined"), so even if you're not looking at substrings, it's still a
sloppy mess.
This is only on a few million messages that comprise a very narrow slice
of the mail flow on the internet, and only from those customers where I
can query their mail trivially.
Hope you understand it now. Not Bayesian!!!!
Perhaps not, but it seems like it's a natural precursor to a bayesian
implementation. As RW said further down the thread:
the only difference between
"ambulatory care" -> only in ham
"aall cards" -> only in spam
and
"ambulatory care" occurs 16 times in ham and 0 times in spam
"aall cards" occurs 0 times in ham and 3 times in spam
is that you have discarded the count information.
And count information is important in determining the likely
trustworthiness of a result. What would your system do with a phrase
that appears in thousands of ham messages, and 2 spam messages? Ignore
it completely?
--
Dave Warren
http://www.hireahit.com/
http://ca.linkedin.com/in/davejwarren