Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

decoder Thu, 05 Mar 2009 03:11:25 -0800

Justin Mason wrote:


Thanks for doing this!  couple of q's:

1. I can offer a bigger ham/spam corpus if you'd like to test against
that as well;
corpora from multiple contributors can sometimes expose training set bias.

That would be cool :) Is this corpus already processed by spamassassin (i.e. has SA headers)?

My poc code currently mines only the headers to find out what rules are triggered.

2. can you test it on spam that scored less than 10 points when it arrived?
low-scoring spam is, of course, more useful to hit than stuff that scored highly
on the existing rules.

Things like that should be possible easily. I need to check if I have enough mails to

do a sufficiently reliable test here.

3. does it give an indication of confidence in its results? or just a
binary "spam"/"ham"
decision?

I'm currently working only with a binary classifier. However, libsvm supports

probability estimates and regression (and to my knowledge, internally, most

SVM algorithms relax classification output to real values and then use the sign to determine the classification, this can also be seen as some sort of confidence value)

4. hey, if you're writing an SVM plugin, it might be worth making one
that _also_
supports body text tokens, similarly to the existing Bayes plugin. ;)

This would surely be possible somehow, but we'd first have to come up with a good representation of the problem for an SVM. I wouldn't want to mix this either with the

current experiment, as these two things somehow represent different data.

One of the problems with text tokens is that there can always be new ones (which would increase the dimension of the problem and hence require the whole SVM to be remodeled,

so, a system as performant as bayes might not work directly.)

5. btw one particularly tricky part of dealing with user-trainable
dbs, is supporting
expiry of old tokens.  but that can be deferred until later anyway.

I guess this is a question of implementation :)




Best regards,


Chris

smime.p7s
Description: S/MIME Cryptographic Signature

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to