Hello PieterB,

Justin has already answered, better than I can, but I'll add my two
cents:

Wednesday, January 14, 2004, 4:23:07 AM, you wrote:

P> I would like to start contributing to spamassassin and help to fight
P> spam.

Fastastic.  Welcome aboard.

P> http://au.spamassassin.org/hacking.html lists how to submit
P> mass-check results. I have a couple of questions:

P> * The CORPUS_POLICY lists that you should use hand-verified spam/ham
P>   tiles, but the CORPUS_SUBMIT lists that you should only check the
P>   top 20 spam/ham messages. I'm pretty sure my corpus is quite good,
P>   but I don't want to check every message by hand. Can anybody
P>   elaborate on this policy?

I check every message by hand.  By that I mean that once or twice a day I
visit my spamtrap and my hamtrap. My email client (The Bat!) makes this
very easy.

Spam is sorted by subject, and I review the list. Obfuscated
subjects and repeated subjects (a dozen emails with the same subject) are
skipped right over. I glance at the subject of every email with a normal
looking unique subject header. If there's any question that it might be
ham, I then actually look at the email.

The main reason for doing this is actually to find FPs. Found one resume
flagged FP last week, and a very long email with very specialized words
in it this week. First time in a long time I've had two FPs in two weeks.

All verified spam is then sent first to sa-learn, and then to a weekly
spamtrap. At end of week I eliminate all duplicates from the weekly
spamtrap, and add the remaining emails to my corpus.

I also scan the subjects, from, and to for all ham, looking for spam. I
glance at any that may be spam, and if they are I drop them into my FN
folder. FNs go into sa-learn as spam, I then also see if there are simple
rules I can add to catch this spam in the future, and the eventualy end
up in the same weekly spam file.

The ham is separated into two piles -- that which discusses spam
technicalities and therefore can have spamsign in them, and normal ham.
The normal ham goes into sa-learn as ham, and into a weekly ham file,
which at end of week is weeded for duplicates, and added to my corpus.
The spamsign emails are purged after a short while.

P> * Should the corpora be approx. 50% ham and 50% spam?

Some say yes, some say no. The trick is to include as much ham as
possible, since these days spam so easily outnumbers ham. I currently
have 74874 spam and 17338 ham in my corpus. That's 4 months' spam, and 3
years' ham.

P> * How many people submit their mass-check results? How many messages
P>   are in their corpora?

My numbers are above.  I've not yet submitted my mass-check results to
the central activity, but hope to begin doing so shortly.

Bob Menschel





-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to