Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

RW Thu, 15 Feb 2018 11:11:20 -0800

On Thu, 15 Feb 2018 11:56:55 -0600 (CST)
sha...@shanew.net wrote:

> On Thu, 15 Feb 2018, RW wrote:
>


> > As I said, Bayes is based on frequencies.
> >
> > If a token occurs in 10% of ham and 0.5% of spam based on 10,000
> > hams and 10,000 spams, what do you think is likely to happen to
> > those percentages with 10,000 hams and 1,000,000 spams?  
> 
> ...
> So, the sample size doesn't matter when calculating the probability of
> a message being spam based on individual tokens, but it can matter
> when we bring them all together to make a final calculation.

It's not a matter of how they combine, smaller counts just lead to
less accurate token probabilities.

I'm not saying that it doesn't matter how much you train, I'm saying
that if you have enough spam and enough ham Bayes is insensitive to
the ratio.

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

Reply via email to