Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

In order to reduce false positives in the SURBL data, we would
like to have access to ham corpora.  Does anyone know of any
public ham copora, including just the URI domain names from the
hams?  Or is there anyone who would be willing to run our URI
domain lists against their ham?

Does anyone know if messages from the Enron corpus have been
categorized for ham and spam?

 http://www-2.cs.cmu.edu/~enron/

Thanks in advance for any suggestions, comments, thoughts....

FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

Now, I know not everybody runs SpamAssassin, but it *does* have a really
easy log format and hit-frequencies program. It's possible to
concatenate ham and spam logs from different sources to effectively get
statistics on a larger corpus... and only the test hits are stored in
the log, so the results are effectively anonymous.

There's ham.log for ham, and spam.log for spam, and the entries look
like this, one line per message:

Y  7 /spamdir/11710. URIBL_OB_SURBL,URIBL_WS_SURBL time=1089946124

Rather than re-invent the wheel, you can have your checkers output
simplified mass-check logs. The only column that matters is the tests
column. Something like this should work well enough for hit-frequencies:

N  0 <any_string> URIBL_TESTS_HIT,COMMA_DELIMITED time=<any_integer>

Then, grab hit-frequencies from the SA distribution and you can
reproduce the output that others have been posting.

If you *do* have SA installed (even if you don't filter your mail with
it), it's even easier. Just set up a simple .cf file with the URIBL
rules (I'll provide one on request), and invoke mass-check in the tools
directory like so:

    ./mass-check -p=../rules -c=../rules --net -j=20 --progress \
        spam:dir:${SPAMDIR} ham:dir:${HAMDIR}

Then run:

    ./hit-frequencies -s 3 -p

It's almost worth extracting Mail-SpamAssassin from CPAN just to gain
that functionality. You don't even have to *use* SA. :-)

- Ryan

--
  Ryan Thompson <[EMAIL PROTECTED]>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Reply via email to