On Wed, 2 Dec 2015 17:14:22 +0000 Sebastian Arcus wrote: > On 02/12/15 12:55, Reindl Harald wrote: > > > > > > Am 02.12.2015 um 12:51 schrieb Sebastian Arcus: > >> I hope I'm not exceeding the patience of the list by posting a > >> third question in two days :-) > >> > >> I realise the above question is a "soft" question, probably > >> without a definite "yes" or "no" answer.
Yery true. > > additionally we share our bayes with another company which pulls > > the dumps if the hash file is different every 30 minutes > > > > we as well as the other company does mail hosting on ISP level and > > the results on both sides are perfect - we share even scorings, > > whitelists, custom body/subject-rules and the summary is: at least > > in the same country sharing spamfilter configurations works like a > > charme > > Perfect - that's exactly the sort of real-life based advice I was > looking for. Many thanks! It's not really surprising that the diverse mail of 2 similar ISPs is similar for Bayes, especially with the headers removed. Whether your ham looks like your client's ham is an entirely different matter. If the ham isn't similar then using your ham-heavy database is likely to be sub-optimal. There's also the ham:spam ratio - at one point you quoted a figure of 12000:300. An imbalance is not intrinsically wrong, but it could cause problems if you transplant it into a system where new training occurs at a very different ratio. Any new tokens that appear in the second system are heavily skewed to being treated as spammy. What's particularly bad is if you strip headers in your corpus and then the client goes on to train without stripping them, then neutral tokens that got stripped enter the database as heavily spammy.