On Sun, 18 Jan 2004 17:41:00 +0100, PieterB <[EMAIL PROTECTED]> writes:
> Hi, > > I have an idea, similar to Scott A Crosby's datamining application. > I didn't use a datamining/analysis program, but used the Bayes > database. For example if you use: > > sa-learn --dump all | grep "^0\.999 *[0-9]* *0 [0-9]*" > > sa-learn will show all Bayes entries which are clearly a sign of spam > (score=0.999, zero occurences in ham). I'd suggest not limiting yourself to domains that never occur in ham. False classifications are a fact of life. A domain that occured in 50 spams and 1 ham should at least be examined further to see if it is a false classification. Also, sorting the output by: (spamhits)/(3*hamhits+1) may make it easier to analyze and read. > After manuallycleaning up the list for non URL's, I have lines like: > > 0.999 36 0 1073851236 www.10cial.biz > 0.999 49 0 1074054013 www.tupit.info > 0.999 58 0 1074283556 U*www.treasurecity.biz.in > 0.999 38 0 1073851236 D*naturalgrowth.us Excellent suggestion. As a rule, I'd say that datamining to find new rules is going to find some types of rules a lot faster and more scalable than manually noticing them. Its also going to find things that people would miss. > I'm thinking of writing a script that can use this information and > can filter the spam mbox to find the full URL patterns. These URL > patterns can then be used to write custom rules, or to extend the > bigevil ruleset. The full URL patterns might be useful. Spammers can switch domains faster than they can their tracking and hosting software systems. Scott ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk