On Sun, 18 Jan 2004 17:41:00 +0100, PieterB <[EMAIL PROTECTED]> writes:

> Hi,
> 
> I have an idea, similar to Scott A Crosby's datamining application.
> I didn't use a datamining/analysis program, but used the Bayes
> database. For example if you use:
> 
>       sa-learn --dump all | grep "^0\.999 *[0-9]*  *0 [0-9]*"
> 
> sa-learn will show all Bayes entries which are clearly a sign of spam
> (score=0.999, zero occurences in ham). 

I'd suggest not limiting yourself to domains that never occur in
ham. False classifications are a fact of life. A domain that occured
in 50 spams and 1 ham should at least be examined further to see if it
is a false classification.

Also, sorting the output by:  (spamhits)/(3*hamhits+1) 
may make it easier to analyze and read.


> After manuallycleaning up the list for non URL's, I have lines like:
> 
> 0.999         36          0 1073851236  www.10cial.biz
> 0.999         49          0 1074054013  www.tupit.info
> 0.999         58          0 1074283556  U*www.treasurecity.biz.in
> 0.999         38          0 1073851236  D*naturalgrowth.us

Excellent suggestion. As a rule, I'd say that datamining to find new
rules is going to find some types of rules a lot faster and more
scalable than manually noticing them. Its also going to find things
that people would miss.

> I'm thinking of writing a script that can use this information and
> can filter the spam mbox to find the full URL patterns. These URL
> patterns can then be used to write custom rules, or to extend the
> bigevil ruleset.

The full URL patterns might be useful. Spammers can switch domains
faster than they can their tracking and hosting software systems.

Scott


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to