On 2/28/02 7:06 AM, "Shane Williams" <[EMAIL PROTECTED]> wrote:
> On Thu, 28 Feb 2002, Michael Moncur wrote: > >> While some of the negative scores (like DEAR_SOMEBODY) might have >> really turned into legitimate indicators of non-spam, I don't think >> any message deserves having its spam score reduced by 8 points by >> virtue of its mentioning "www.monsterhut.com", a well-known spam >> source. > > This got me thinking. Does the corpus contain emails discussing spam? > If so, that would clearly throw off the evolution of scores. That's not the problem here -- the corpus contains 40 "spam" messages with monsterhut and 0 nonspams with monsterhut. The score is coming out as -ve because the monsterhut test is non-descriminating. Every monsterhut message contains enough other rules being hit that the score is over the threshold regardless of whether that rule is scored high or low. > Similarly, I think part of the problem is that everyobody's spam and > non-spam may be vastly different. Obviously, the more sources the > corpus is drawn from the less this will be an issue, but until then > the GA will be craeting scores tuned more accurately for the types of > users who submit to the corpus. This is true, but much effort has been put into making the corpus representative of a broad range of mail. Spam is much more similar than nonspam. For example, Business users tend to like emails with dollar signs in them more than techie users. I think there are basically 3 categories of email users for spam-id purposes (if we go down the road of letting the user choose which rules/scores to apply to themselves): TECHIE, BUSINESS, and MOM (aka AOL user). There might possibly be another category COLLEGEKID or something, but I don't know if that wouldn't be subsumed in one of the other 3. It's also possible that BUSINESS would work fine for MOM. C _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk