On Thursday 21 January 2016 at 13:11:15, RW wrote: > On Wed, 20 Jan 2016 22:21:49 -0800 Marc Perkel wrote: > > OK - Just to show you this isn't Bayesian - see if you can do this. > > > > Here is a list of 5505874 words and phrases used in the subject line > > of HAM and never seen in the subject line of SPAM > > > > http://www.junkemailfilter.com/data/subject-ham.txt > > > > Here is a list of 3494938 words and phrases used in the subject line > > of SPAM and never seen in the subject line of HAM > > > > http://www.junkemailfilter.com/data/subject-spam.txt > > > > Hope you understand it now. Not Bayesian!!!! > > the only difference between > > > "ambulatory care" -> only in ham > "aall cards" -> only in spam > > and > > "ambulatory care" occurs 16 times in ham and 0 times in spam > > "aall cards" occurs 0 times in ham and 3 times in spam > > is that you have discarded the count information.
Plus, the "never in ham" and "never in spam" lists omit any mention of words & phrases which exist in differing proportions in both - Bayes includes that, and I would expect that a spam identifier which takes account of as many known charactersistics of spam/ham as possible is going to do the best job. Antony. -- Software development can be quick, high quality, or low cost. The customer gets to pick any two out of three. Please reply to the list; please *don't* CC me.