Justin Mason wrote:
> "Michael 'Moose' Dinn" said:
> 
> 
>>Has anyone taken a huge spam database and sent it through some sort of
>>genetic learning program to see if spam can be identified that way?
>>More of a curiosity thing than anything else.
> 
> 
> Yes, there's a group in Greece who are doing this (slowly -- they don't
> have many people working on it ;).
> 
> It works, but when I contacted them it seemed that SpamAssassin had a
> better hit rate (probably since we have a lot of human ingenuity on the
> case too, instead of just relying on machine learning alone).
> 
> There's also "ifile" for MH users, which (iirc) uses Naive Bayesian
> classification to classify incoming mail to folders automatically.
> 
> One issue those have, is that they need a lot of pre-processing code to
> strip off common formatting, headers, footers etc. so your AI code doesn't
> just start thinking "it came via sf.net lists, therefore it's spam" when
> you're subscribed to spamassassin-sightings, for example. ;)

The biggest part of my integration of Naive Bayes into our code was the 
extremely long (and boring) process of fine-tuning it to find the best 
hit rate. This of course relys on a good data set, which a lot of people 
at universities don't have. You'll also see Uni papers claiming much 
higher hit rates than machine learning really gives you due to them 
having crappy datasets (i.e. if your real mail all talks about Maths, 
and your spam doesn't, it's really easy to classify with NaiveBayes).

Having said that, we do get good results now. I'm going to add bagging 
to it soon, and that should improve things further (I was going to add 
boosting, but it's too complex for my tiny brane).

Matt.




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Oh, it's good to be a geek.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to