On Fri, 5 Dec 2003 10:21:31 -0500, Chris Santerre <[EMAIL PROTECTED]> writes:

> Basically you see the rules now in Alpha order. This is because I cat >> all
> my lists together for the last few months, sorted, and ran uniq. My scrpits
> for writing the rules work with 2 formats:
> 
> 1 domain per line
> many domans per line seperated by a pipe '|'

Convert all from the second format into the second format. You only
need 4 scripts.

1.   Take a bunch of URL's (one per line) and construct SA rules.
2.   Extract domain names from spam and canonicalize them to lower case
3.   Perform set union.
4.   Perform set difference.
5.   Perform set intersection.

#1 should generate an output performing all of the regexp
optimizations I suggested. 3,4 are 5 line perl scripts.

> I'm looking to make this better in any way, while still keeping my sanity :)

You have 4 files:

  NEW = New domains to classify.

  SPAM = Known spam-only domains.
  GOOD = Known good domains that are accidently in spam a lot (yahoo, etc)
  SOSO = Domains with a few FP's.

We have the property that no domain is in any two of SPAM, GOOD, SOSO.

Now compute the following sets: [1]

X= (NEW-SPAM)-GOOD-SOSO 

X is all new domains that have never been seen. THey must be
classified into SPAM, GOOD. or SOSO.

Let SPAM' GOOD' and SOSO' be the new sets after adding in the new domains.

Now you can report the changes as:

A= GOOD'-GOOD = new domains that were added to good.
B= SPAM'-SPAM = new domains that were added to good.
C= SOSO'-SOSO = new domains that were added to good.

D=
 ( SOSO' AND (GOOD + SPAM) ) + 
 ( SPAM' AND (GOOD + SOSO) ) +
 ( GOOD' AND (SOSO + SPAM) )
            = domains that were reclassified.

This example only uses two versions, but we could easily scale this to
use versioned filenames, for instance 'SPAM-1.23' for SPAM, and
'SPAM-1.24' for SPAM'

Now, given the classification of the 4 files NEW, GOOD, SPAM, SOSO,
compute X, then classify just those new domains to form the new files
GOOD', SPAM', SOSO'. Publish GOOD', SPAM', SOSO' for use, and also
A,B,C,D as a changes logfile. You can also apply program #1 on SPAM'
and SOSO' to generate a SA rulefile and publish that.

Most of this is automatable with a shell script.

Scott

[1]

'+' denotes set union.
'-' denotes set difference
'AND' denots intersection.


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to