On Thu, Mar 5, 2009 at 11:12, decoder <deco...@own-hero.net> wrote: > Justin Mason wrote: >> >> Thanks for doing this! couple of q's: >> >> 1. I can offer a bigger ham/spam corpus if you'd like to test against >> that as well; >> corpora from multiple contributors can sometimes expose training set bias. >> > > That would be cool :) Is this corpus already processed by spamassassin (i.e. > has SA headers)? > > My poc code currently mines only the headers to find out what rules are > triggered.
yep, it is. OK, let me take a look later on tonight and see if I can make up a tarball for you... >> 2. can you test it on spam that scored less than 10 points when it >> arrived? >> low-scoring spam is, of course, more useful to hit than stuff that scored >> highly >> on the existing rules. > > Things like that should be possible easily. I need to check if I have enough > mails to > do a sufficiently reliable test here. cool. >> 3. does it give an indication of confidence in its results? or just a >> binary "spam"/"ham" >> decision? > > I'm currently working only with a binary classifier. However, libsvm > supports > probability estimates and regression (and to my knowledge, internally, most > SVM algorithms relax classification output to real values and then use the > sign > to determine the classification, this can also be seen as some sort of > confidence value) yep, that should work. >> 4. hey, if you're writing an SVM plugin, it might be worth making one >> that _also_ >> supports body text tokens, similarly to the existing Bayes plugin. ;) >> > > This would surely be possible somehow, but we'd first have to come up with a > good > representation of the problem for an SVM. I wouldn't want to mix this either > with the > current experiment, as these two things somehow represent different data. > > One of the problems with text tokens is that there can always be new ones > (which would > increase the dimension of the problem and hence require the whole SVM to be > remodeled, > so, a system as performant as bayes might not work directly.) interesting, I hadn't thought of that angle. --j.