Reindl Harald [mailto:[email protected]] wrote:
> > However, that doesn't happen.
> > 0.000 0 338770 0 non-token data: nspam
> > 0.000 0 1460807 0 non-token data: nham
> what do you expect when you train 4 times more ham than spam?
> frankly you "flooded" your bayes with 1.4 Mio ham-samples and i thought
> our 140k total corpus is large - don' forget that ham messages are
> typically larger than junk trying to point you with some words to a URL
>
> 108897 SPAM
> 31492 HAM
This is a production mail gateway serving since 2015. I saw that a few messages
(both hams and spams) automatically learned by amavisd/spamassassin. Today's
statistics:
3616 autolearn=ham
10076 autolearn=no
2817 autolearn=spam
134 autolearn=unavailable
I think I have no control over what is learnt automatically.
Let's just assume for a moment that 1.4M ham-samples are valid.
Is there a ham:spam ratio I should stick to it? I presume if we have a 1:1
ratio then future messages won't be considered as spam as well.
Regards
Szabolcs