Justin Mason wrote:
That would be cool :) Is this corpus already processed by spamassassin (i.e. has SA headers)?Thanks for doing this! couple of q's: 1. I can offer a bigger ham/spam corpus if you'd like to test against that as well; corpora from multiple contributors can sometimes expose training set bias.
My poc code currently mines only the headers to find out what rules are triggered.
Things like that should be possible easily. I need to check if I have enough mails to2. can you test it on spam that scored less than 10 points when it arrived? low-scoring spam is, of course, more useful to hit than stuff that scored highly on the existing rules.
do a sufficiently reliable test here.
I'm currently working only with a binary classifier. However, libsvm supports3. does it give an indication of confidence in its results? or just a binary "spam"/"ham" decision?
probability estimates and regression (and to my knowledge, internally, mostSVM algorithms relax classification output to real values and then use the sign to determine the classification, this can also be seen as some sort of confidence value)
This would surely be possible somehow, but we'd first have to come up with a good representation of the problem for an SVM. I wouldn't want to mix this either with the4. hey, if you're writing an SVM plugin, it might be worth making one that _also_ supports body text tokens, similarly to the existing Bayes plugin. ;)
current experiment, as these two things somehow represent different data.One of the problems with text tokens is that there can always be new ones (which would increase the dimension of the problem and hence require the whole SVM to be remodeled,
so, a system as performant as bayes might not work directly.)
5. btw one particularly tricky part of dealing with user-trainable dbs, is supporting expiry of old tokens. but that can be deferred until later anyway.
I guess this is a question of implementation :) Best regards, Chris
smime.p7s
Description: S/MIME Cryptographic Signature