At 01:42 PM 7/16/2003 -0700, I Am Jesus @sent wrote:
I see evidence that the current scores are not being optimally set. A LOT of the Rules scores are apparrently still at initial values: integers, 0.001, or 0.100 suggesting they weren't really optimized by the GA. It obviously isn't being run by a decent number of people on a decent amount of spam and ham and submitted to SA. Looks like the project is in need of some help in this area. If I wasn't working on my own antispam idea, I'd do it... Please don't flame me if I've misread the evidence. (I refuse to bite flamebait. People who flame people for typos or grammar errors shouldn't (especially when their writing isn't perfect either!))

You can look at STATISTICS.txt for data on how many spam/nonspam messages were used as a corpus. It can also provide you hit-results for the rules that didn't get scores assigned.


From 2.54:
# Correctly non-spam: 130678  56.21%  (99.92% of non-spam corpus)
# Correctly spam:      90057  38.74%  (88.55% of spam corpus)

So I'd consider over 220,000 messages to be a contradiction to your claim that it obviously isn't using a decent sized corpus. However, the corpus could use a wider variety of participants (statistics.txt doesn't show this, but rsyncing the mass-check data can give you an idea of how many submitters there are, although it will only list direct submitters and will not show who is actually incorporating an extra corpus of mail someone else gave them)


But, there's a drawback to lots of corpus participants.. anyone participating in the corpus needs to be extraordinarily careful about their classification of emails... it only takes one spam message in the nonspam pile to cause some unfortunately poor score assignments (since the GA tries to lean away from FP's pretty hard (by a factor of 100), a mis-categorized spam message is 100 times worse than a mis-categorized nonspam).


So there's this weird balance where SA needs as wide a variety of submitters they can get, but they also need to make sure all of them can be trusted to be diligent in their corpus maintenance. I've maintained my own corpus on the side for testing.. it's much harder than it sounds.





-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to