> > This is a misunderstanding. I am largely against > whitelisting or negative score rules. I merely intend to > increase the variety of legitimate mail in the nightly ham > corpus so our spam-hostile rules can be better tested for > safety. This will be interesting especially with non-English ham. > > Warren >
Warren, so, are you going to keep two or more corpus datasets? one as it is, and one with the new for comparison? initially this came across as a really suspect idea... i.e., one man's junk is another man's treasure for a moment, it appeared we were gonna need to review the good and the bad of spam-l to avoid serious SA list issues. statistically speaking, this shouldnt sway the scoring substantially anyways would it? what should be known so that bad data is not allowed into the HAM corpus ? - rh