Let the mass-checking begin! The mass-check results are used as input for the genetic algorithm (GA) that generates SpamAssassin rule scores. Basically, the people who have both an email corpus and the capability to run mass-check submit their mass-check data to the SA developers and one of them (Theo is doing 2.50) runs the GA to generate new optimized scores.
This is the process for the second set of mass-checks for 2.50. Note that the CVS tag name has changed from the first set. There was also a bug found that caused the CVS tag name to change again, please note the new name below. For this second set of mass-checks, there are two complete and separate runs on your corpus: run 1: mass-check with Bayes and with network run 2: mass-check with Bayes and without network The process is a wee bit complicated and it takes a while to run, so we're giving everyone plenty of time to finish. The ground rules are below. If you have questions or problems, please post to spamassassin-talk. ------------------------------------------------------------------------ Here are the ground rules and procedure. The CVS tag to be used for this mass-check is named: CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 Use that tag name instead of the usual CURRENT_CORPORA_SUBMIT_VERSION or the previous CORPORA_SUBMIT_VERSION_2_5_0_CHECK2. Everyone has until Thu Feb 13 23:59:00 GMT 2003 to upload their mass-check results. Earlier is better, of course. :-) 1. your corpus must follow the basic content policy described in masses/CORPUS_POLICY 2. use the process described in masses/CORPUS_SUBMIT for making sure your ham and spam is clean 3. we will use the procedure in masses/CORPUS_SUBMIT_NIGHTLY for this mass-check with the following modifications: a. Running the test: - check out using the CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 revision - rm masses/spamassassin/bayes* before every mass-check run - rm masses/spamassassin/user_prefs too - use a single mass-check command for each of the two runs so everything will be sorted by date (very important for Bayes!) - do *not* hand-train Bayes - please try to have a decent amount of ham relative to the amount of spam (perhaps twice as much for the larger of the spam/ham) b. AGAIN, YOU MUST REMOVE masses/spamassassin/bayes* before each check!!! b. Here are the commands and options for the runs. SAFE COMMANDS TO USE: run 1: "mass-check --net --all <targets>" run 2: "mass-check --all <targets>" Options for mass-check: - required options for *first run only*: --net - recommended options: --all - optional options: -j, --mid - do *not* use these options: -o, -n, --head, or --tail, --mbox, --file, or --dir Note that the target mail folder specification used on the mass-check command line has changed since 2.43. See the top of the mass-check file for the format. Regarding -j, it is not recommended to go above -j 4 when network checks are on (or above the number of processors in your system when network checks are off). Also note that -j will not work if your system does not have Unix domain sockets. If it doesn't work, don't use it. A full listing of commands to checkout and run are listed below if you need them. c. User preferences: For *this* run, masses/spamassassin/user_prefs should be completely empty. Please make sure that it does *not* contain any auto_learn setting. Better yet, unless you need to change any settings (dcc_path, etc), just delete the file. d. Which network tests have to be working You need to have working DNS tests. Razor2/DCC/Pyzor support is also highly desired so try to have that working too. e. Submitting your results Upload them via rsync with these names: for run 1: "ham-bayes-net-username.log" and "spam-bayes-net-username.log" for run 2: "ham-bayes-nonet-username.log" and "spam-bayes-nonet-username.log" Also make sure the tag name of CORPORA_SUBMIT_VERSION_2_5_0_CHECK2 appears at the top of each. Everyone will need to have a login/password from Craig Hughes <[EMAIL PROTECTED]> to rsync since anonymous uploads won't work. If I missed anything important in this procedure, please reply and let me know, but assume the procedure has not changed until I post a new "NOTICE" with "(REV3)" in the subject. That way, there will be less confusion about what to do. ------------------------------------------------------------------------ CHECKING OUT AND BUILDING THE MASS-CHECK VERSION OF SPAMASSASSIN: cd /home/your/tmp-directory-for-mass-checking cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/spamassassin \ login cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/spamassassin \ co -r CORPORA_SUBMIT_VERSION_2_5_0_CHECK2 spamassassin cd spamassassin perl Makefile.PL < /dev/null make RUNNING THE BAYES+NET RUN: cd masses rm spamassassin/bayes* rm spamassassin/user_prefs rm ham.log spam.log ./mass-check --net -j 4 --all <targets> head ham.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 head spam.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 [make sure the Tag line appears correctly] RSYNC_PASSWORD="[whatever_craig_told_you_it_was]" export RSYNC_PASSWORD rsync -CPcvuzb ham.log \ [EMAIL PROTECTED]::corpus/ham-bayes-net-username.log rsync -CPcvuzb spam.log \ [EMAIL PROTECTED]::corpus/spam-bayes-net-username.log RUNNING THE BAYES+NO-NET RUN: cd masses rm spamassassin/bayes* rm spamassassin/user_prefs rm ham.log spam.log ./mass-check --all <targets> head ham.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 head spam.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 [make sure the Tag line appears correctly] RSYNC_PASSWORD="[whatever_craig_told_you_it_was]" export RSYNC_PASSWORD rsync -CPcvuzb ham.log \ [EMAIL PROTECTED]::corpus/ham-bayes-nonet-username.log rsync -CPcvuzb spam.log \ [EMAIL PROTECTED]::corpus/spam-bayes-nonet-username.log
msg12357/pgp00000.pgp
Description: PGP signature