On Thursday 21 January 2016 at 13:11:15, RW wrote:

> On Wed, 20 Jan 2016 22:21:49 -0800 Marc Perkel wrote:
> > OK - Just to show you this isn't Bayesian - see if you can do this.
> > 
> > Here is a list of 5505874 words and phrases used in the subject line
> > of HAM and never seen in the subject line of SPAM
> > 
> > http://www.junkemailfilter.com/data/subject-ham.txt
> > 
> > Here is a list of 3494938 words and phrases used in the subject line
> > of SPAM and never seen in the subject line of HAM
> > 
> > http://www.junkemailfilter.com/data/subject-spam.txt
> > 
> > Hope you understand it now. Not Bayesian!!!!
> 
> the only difference between
> 
> 
>   "ambulatory care" -> only in ham
>   "aall cards"      -> only in spam
> 
> and
> 
>    "ambulatory care"  occurs 16 times in ham and 0 times in spam
> 
>    "aall cards"       occurs  0 times in ham and 3 times in spam
> 
> is that you have discarded the count information.

Plus, the "never in ham" and "never in spam" lists omit any mention of words & 
phrases which exist in differing proportions in both - Bayes includes that, and 
I would expect that a spam identifier which takes account of as many known 
charactersistics of spam/ham as possible is going to do the best job.


Antony.

-- 
Software development can be quick, high quality, or low cost.

The customer gets to pick any two out of three.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Reply via email to