Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Justin Mason Thu, 05 Mar 2009 09:27:09 -0800

On Thu, Mar 5, 2009 at 11:12, decoder <deco...@own-hero.net> wrote:
> Justin Mason wrote:
>>
>> Thanks for doing this!  couple of q's:
>>
>> 1. I can offer a bigger ham/spam corpus if you'd like to test against
>> that as well;
>> corpora from multiple contributors can sometimes expose training set bias.
>>
>
> That would be cool :) Is this corpus already processed by spamassassin (i.e.
> has SA headers)?
>
> My poc code currently mines only the headers to find out what rules are
> triggered.


yep, it is.  OK, let me take a look later on tonight and see if I can make
up a tarball for you...

>> 2. can you test it on spam that scored less than 10 points when it
>> arrived?
>> low-scoring spam is, of course, more useful to hit than stuff that scored
>> highly
>> on the existing rules.
>
> Things like that should be possible easily. I need to check if I have enough
> mails to
> do a sufficiently reliable test here.

cool.

>> 3. does it give an indication of confidence in its results? or just a
>> binary "spam"/"ham"
>> decision?
>
> I'm currently working only with a binary classifier. However, libsvm
> supports
> probability estimates and regression (and to my knowledge, internally, most
> SVM algorithms relax classification output to real values and then use the
> sign
> to determine the classification, this can also be seen as some sort of
> confidence value)

yep, that should work.

>> 4. hey, if you're writing an SVM plugin, it might be worth making one
>> that _also_
>> supports body text tokens, similarly to the existing Bayes plugin. ;)
>>
>
> This would surely be possible somehow, but we'd first have to come up with a
> good
> representation of the problem for an SVM. I wouldn't want to mix this either
> with the
> current experiment, as these two things somehow represent different data.
>
> One of the problems with text tokens is that there can always be new ones
> (which would
> increase the dimension of the problem and hence require the whole SVM to be
> remodeled,
> so, a system as performant as bayes might not work directly.)

interesting, I hadn't thought of that angle.

--j.

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to