Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Justin Mason Thu, 05 Mar 2009 02:07:29 -0800

On Thu, Mar 5, 2009 at 00:23, decoder <deco...@own-hero.net> wrote:
> decoder wrote:
>> Justin Mason wrote:
>>> So you're volunteering to code it up, then? ;)
>>
>> I was planning to do at least some brainstorming+experiements as to what
>> learning methods would seem suitable and how well the method performs,
>> whenever I have time again. Unless someone else did that already?
>>
>
> Ok, I did some short experiments: I've built an SVM classifier from a large
> mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross
> validation. The resulting classifier has an accuracy of over 99%, so
> performs as good as the regular system. Now I applied this to a set of 202
> False Negatives that I collected, and 69 of these are recognized as spam by
> the SVM. As a second test, I pulled 2707 mails from one of my other inboxes
> and applied the classifier, the accuracy was again over 99% (and this is
> only ham).
>
> From my point of view, the results show that this approach has potential. It
> is highly accurate with respect to the current system, but additionally
> outperformed it on several false negatives.
>
>
> There are other advantages that this system has over the common system: It
> allows everybody to train the whole spamfilter (not only Bayes) to the kind
> of spam that one receives, i.e. it is more adaptive than the common system.
>
>
> Any opinions on this are greatly welcome. Maybe we should try to come up
> with a proof of concept plugin for SA?


Thanks for doing this!  couple of q's:

1. I can offer a bigger ham/spam corpus if you'd like to test against
that as well;
corpora from multiple contributors can sometimes expose training set bias.

2. can you test it on spam that scored less than 10 points when it arrived?
low-scoring spam is, of course, more useful to hit than stuff that scored highly
on the existing rules.

3. does it give an indication of confidence in its results? or just a
binary "spam"/"ham"
decision?

4. hey, if you're writing an SVM plugin, it might be worth making one
that _also_
supports body text tokens, similarly to the existing Bayes plugin. ;)

5. btw one particularly tricky part of dealing with user-trainable
dbs, is supporting
expiry of old tokens.  but that can be deferred until later anyway.

--j.

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to