On Wed, 2 Dec 2015 17:14:22 +0000
Sebastian Arcus wrote:

> On 02/12/15 12:55, Reindl Harald wrote:
> >
> >
> > Am 02.12.2015 um 12:51 schrieb Sebastian Arcus:  
> >> I hope I'm not exceeding the patience of the list by posting a
> >> third question in two days :-)
> >>
> >> I realise the above question is a "soft" question, probably
> >> without a definite "yes" or "no" answer.

Yery true.

> > additionally we share our bayes with another company which pulls
> > the dumps if the hash file is different every 30 minutes
> >
> > we as well as the other company does mail hosting on ISP level and
> > the results on both sides are perfect - we share even scorings, 
> > whitelists, custom body/subject-rules and the summary is: at least
> > in the same country sharing spamfilter configurations works like a
> > charme  
> 
> Perfect - that's exactly the sort of real-life based advice I was 
> looking for. Many thanks!

It's not really surprising that the diverse mail of 2 similar ISPs is
similar for Bayes, especially with the headers removed. Whether your
ham looks like your client's ham is an entirely different matter. If
the ham isn't similar then using your ham-heavy database is likely to
be sub-optimal. 


There's also the ham:spam ratio - at one point you quoted a figure of
12000:300. An imbalance is not intrinsically wrong, but it could cause
problems if you transplant it into a system where new training occurs at
a very different ratio. Any new tokens that appear in the second system
are heavily skewed to being treated as spammy. What's particularly bad
is if you strip headers in your corpus and then the client goes on to
train without stripping them, then neutral tokens that got stripped
enter the database as heavily spammy.

Reply via email to