Corpus maintenance is a manual thing, done by me ahead of each SA release.
It's more art than science.  I delete old stuff, I add new stuff, I delete stuff
old or new that looks like it'd confuse the GA (or that does in fact confuse the
GA on trial runs).  When I think the corpus looks "good", I run the GA.  I check
the output.  If it looks funky, I re-check the corpus, often finding something
odd-looking that I'd overlooked on previous checks.  I then lather, rinse,
repeat.  Eventually, I'm happy with the GA output, and package a release x.y0.
Then people on the mailing list point out problems, and I release x.y1 a few
days later :)

Vivek Khera wrote:

VK> >>>>> "ON" == Olivier Nicole <[EMAIL PROTECTED]> writes:
VK>
VK> ON> Now, a score has to be assigned for each rule, so the GA and as
VK> ON> importantly a corpus of spam/non-spam is used. But going into too many
VK> ON> details may also be confusing as most of the users will not need to
VK> ON> deal with the GA.
VK>
VK> Given the way this arms race evolves (we improve, they "improve"),
VK> does the corpus change over time with these tricks?  That is, is the
VK> corpus just growning, or are older messages deleted so that the
VK> scoring matches more accurately what is happening now, rather than
VK> trying to average out all the tricks ever used by spammers (including
VK> those they don't use anymore)?


_______________________________________________________________
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to