Let the mass-checking begin!

The mass-check results are used as input for the genetic algorithm (GA)
that generates SpamAssassin rule scores.  Basically, the people who have
both an email corpus and the capability to run mass-check submit their
mass-check data to the SA developers and one of them (Theo is doing
2.50) runs the GA to generate new optimized scores.

This is the process for the second set of mass-checks for 2.50.  Note that
the CVS tag name has changed from the first set.  There was also a bug found
that caused the CVS tag name to change again, please note the new name below.

For this second set of mass-checks, there are two complete and separate
runs on your corpus:

  run 1: mass-check with Bayes and with network
  run 2: mass-check with Bayes and without network

The process is a wee bit complicated and it takes a while to run, so
we're giving everyone plenty of time to finish.  The ground rules are
below.

If you have questions or problems, please post to spamassassin-talk.

------------------------------------------------------------------------

Here are the ground rules and procedure.  The CVS tag to be used for
this mass-check is named:

  CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1

Use that tag name instead of the usual CURRENT_CORPORA_SUBMIT_VERSION
or the previous CORPORA_SUBMIT_VERSION_2_5_0_CHECK2.

Everyone has until Thu Feb 13 23:59:00 GMT 2003 to upload their mass-check
results.  Earlier is better, of course.  :-)

1. your corpus must follow the basic content policy described in
   masses/CORPUS_POLICY

2. use the process described in masses/CORPUS_SUBMIT for making sure
   your ham and spam is clean

3. we will use the procedure in masses/CORPUS_SUBMIT_NIGHTLY for this
   mass-check with the following modifications:

   a. Running the test:
      - check out using the CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1 revision
      - rm masses/spamassassin/bayes* before every mass-check run
      - rm masses/spamassassin/user_prefs too
      - use a single mass-check command for each of the two runs so
        everything will be sorted by date (very important for Bayes!)
      - do *not* hand-train Bayes
      - please try to have a decent amount of ham relative to the amount
        of spam (perhaps twice as much for the larger of the spam/ham)

   b. AGAIN, YOU MUST REMOVE masses/spamassassin/bayes* before each check!!!

   b. Here are the commands and options for the runs.

      SAFE COMMANDS TO USE:

      run 1: "mass-check --net --all <targets>"
      run 2: "mass-check --all <targets>"

      Options for mass-check:

      - required options for *first run only*: --net
      - recommended options: --all
      - optional options: -j, --mid
      - do *not* use these options: -o, -n, --head, or --tail, --mbox,
                                    --file, or --dir

      Note that the target mail folder specification used on the
      mass-check command line has changed since 2.43.  See the top of the
      mass-check file for the format.

      Regarding -j, it is not recommended to go above -j 4 when network
      checks are on (or above the number of processors in your system when
      network checks are off).  Also note that -j will not work if your
      system does not have Unix domain sockets.  If it doesn't work, don't
      use it.

      A full listing of commands to checkout and run are listed below
      if you need them.

   c. User preferences:

      For *this* run, masses/spamassassin/user_prefs should be completely
      empty.  Please make sure that it does *not* contain any auto_learn
      setting.  Better yet, unless you need to change any settings
      (dcc_path, etc), just delete the file.

   d. Which network tests have to be working

      You need to have working DNS tests.  Razor2/DCC/Pyzor support is
      also highly desired so try to have that working too.

   e. Submitting your results

      Upload them via rsync with these names:

      for run 1: "ham-bayes-net-username.log" and
                 "spam-bayes-net-username.log"

      for run 2: "ham-bayes-nonet-username.log" and
                 "spam-bayes-nonet-username.log"

      Also make sure the tag name of CORPORA_SUBMIT_VERSION_2_5_0_CHECK2
      appears at the top of each.

      Everyone will need to have a login/password from Craig Hughes
      <[EMAIL PROTECTED]> to rsync since anonymous uploads won't
      work.

If I missed anything important in this procedure, please reply and let
me know, but assume the procedure has not changed until I post a new
"NOTICE" with "(REV3)" in the subject.  That way, there will be less
confusion about what to do.

------------------------------------------------------------------------

CHECKING OUT AND BUILDING THE MASS-CHECK VERSION OF SPAMASSASSIN:

  cd /home/your/tmp-directory-for-mass-checking
  cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/spamassassin \
                  login
  cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/spamassassin \
                  co -r CORPORA_SUBMIT_VERSION_2_5_0_CHECK2 spamassassin
  cd spamassassin
  perl Makefile.PL < /dev/null
  make


RUNNING THE BAYES+NET RUN:

  cd masses
  rm spamassassin/bayes*
  rm spamassassin/user_prefs
  rm ham.log spam.log

  ./mass-check --net -j 4 --all <targets>

  head ham.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1
  head spam.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1
  [make sure the Tag line appears correctly]

  RSYNC_PASSWORD="[whatever_craig_told_you_it_was]"
  export RSYNC_PASSWORD

  rsync -CPcvuzb ham.log \
    [EMAIL PROTECTED]::corpus/ham-bayes-net-username.log
  rsync -CPcvuzb spam.log \
    [EMAIL PROTECTED]::corpus/spam-bayes-net-username.log



RUNNING THE BAYES+NO-NET RUN:

  cd masses
  rm spamassassin/bayes*
  rm spamassassin/user_prefs
  rm ham.log spam.log

  ./mass-check --all <targets>

  head ham.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1
  head spam.log | grep CORPORA_SUBMIT_VERSION_2_5_0_CHECK2_1
  [make sure the Tag line appears correctly]

  RSYNC_PASSWORD="[whatever_craig_told_you_it_was]"
  export RSYNC_PASSWORD

  rsync -CPcvuzb ham.log \
    [EMAIL PROTECTED]::corpus/ham-bayes-nonet-username.log
  rsync -CPcvuzb spam.log \
    [EMAIL PROTECTED]::corpus/spam-bayes-nonet-username.log

Attachment: msg12357/pgp00000.pgp
Description: PGP signature

Reply via email to