Hello PieterB, Justin has already answered, better than I can, but I'll add my two cents:
Wednesday, January 14, 2004, 4:23:07 AM, you wrote: P> I would like to start contributing to spamassassin and help to fight P> spam. Fastastic. Welcome aboard. P> http://au.spamassassin.org/hacking.html lists how to submit P> mass-check results. I have a couple of questions: P> * The CORPUS_POLICY lists that you should use hand-verified spam/ham P> tiles, but the CORPUS_SUBMIT lists that you should only check the P> top 20 spam/ham messages. I'm pretty sure my corpus is quite good, P> but I don't want to check every message by hand. Can anybody P> elaborate on this policy? I check every message by hand. By that I mean that once or twice a day I visit my spamtrap and my hamtrap. My email client (The Bat!) makes this very easy. Spam is sorted by subject, and I review the list. Obfuscated subjects and repeated subjects (a dozen emails with the same subject) are skipped right over. I glance at the subject of every email with a normal looking unique subject header. If there's any question that it might be ham, I then actually look at the email. The main reason for doing this is actually to find FPs. Found one resume flagged FP last week, and a very long email with very specialized words in it this week. First time in a long time I've had two FPs in two weeks. All verified spam is then sent first to sa-learn, and then to a weekly spamtrap. At end of week I eliminate all duplicates from the weekly spamtrap, and add the remaining emails to my corpus. I also scan the subjects, from, and to for all ham, looking for spam. I glance at any that may be spam, and if they are I drop them into my FN folder. FNs go into sa-learn as spam, I then also see if there are simple rules I can add to catch this spam in the future, and the eventualy end up in the same weekly spam file. The ham is separated into two piles -- that which discusses spam technicalities and therefore can have spamsign in them, and normal ham. The normal ham goes into sa-learn as ham, and into a weekly ham file, which at end of week is weeded for duplicates, and added to my corpus. The spamsign emails are purged after a short while. P> * Should the corpora be approx. 50% ham and 50% spam? Some say yes, some say no. The trick is to include as much ham as possible, since these days spam so easily outnumbers ham. I currently have 74874 spam and 17338 ham in my corpus. That's 4 months' spam, and 3 years' ham. P> * How many people submit their mass-check results? How many messages P> are in their corpora? My numbers are above. I've not yet submitted my mass-check results to the central activity, but hope to begin doing so shortly. Bob Menschel ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk