Corpus maintenance is a manual thing, done by me ahead of each SA release. It's more art than science. I delete old stuff, I add new stuff, I delete stuff old or new that looks like it'd confuse the GA (or that does in fact confuse the GA on trial runs). When I think the corpus looks "good", I run the GA. I check the output. If it looks funky, I re-check the corpus, often finding something odd-looking that I'd overlooked on previous checks. I then lather, rinse, repeat. Eventually, I'm happy with the GA output, and package a release x.y0. Then people on the mailing list point out problems, and I release x.y1 a few days later :)
Vivek Khera wrote: VK> >>>>> "ON" == Olivier Nicole <[EMAIL PROTECTED]> writes: VK> VK> ON> Now, a score has to be assigned for each rule, so the GA and as VK> ON> importantly a corpus of spam/non-spam is used. But going into too many VK> ON> details may also be confusing as most of the users will not need to VK> ON> deal with the GA. VK> VK> Given the way this arms race evolves (we improve, they "improve"), VK> does the corpus change over time with these tricks? That is, is the VK> corpus just growning, or are older messages deleted so that the VK> scoring matches more accurately what is happening now, rather than VK> trying to average out all the tricks ever used by spammers (including VK> those they don't use anymore)? _______________________________________________________________ Hundreds of nodes, one monster rendering program. Now that's a super model! Visit http://clustering.foundries.sf.net/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk