Hey there, sorry I didn't respond yet. I was actually figuring out how I
would use these cool scripts. There are some GREAT ideas here, but I might
be working counter to them. Let me explain how I do this, and how I think
updates would be done. This way we can talk about changing it to get better
regex:


Basically you see the rules now in Alpha order. This is because I cat >> all
my lists together for the last few months, sorted, and ran uniq. My scrpits
for writing the rules work with 2 formats:

1 domain per line
many domans per line seperated by a pipe '|'

So I need whatever scripts to be able to deal with that kind of input. I'm
going to hack Alex's code to take a list of 1 domain per line, and convert
it to X number per line for me with pipes. I did that part by hand!!!! OUCH!

Ok, so I have a huge list of domains 1 per line. They INCLUDE FPs that have
been removed from the rules. I want that for reasons I'm about to explain.
So here is how I forsee updates being done. 

Start with same process as before. Script the pulls out all http: domains
from my spam corpus of 5-20 days. strips it down to 1 domain per line. Now I
run a hit frequency script to see how many times a domain from the new list
is in the old list. I'm interested in only the ones with zero hits. Because
the old list contains FPs, it will also eliminate them from the update. So
it keeps track that way, so I never have to worry about akaimetech or
whatever again. 

Now taking the domains with 0 hits , I form a clean new list. I run the
Hacked Alex code script I've yet to do, to convert the list to 15 domains
per line. Then I run the script to make the rules and I tell it what rule
number to start with (179). Poof it makes the new rules. I cat >> to the old
rules and we are done. However now they are not in ALPHA anymore. 1st part
of list in Alpha, then update in Alpha. 

Obviously the only way to do this is to recreate the entire ruleset from a
big list again. I wanted to shy away from this. I would also have to keep a
seperate FP list to match against, rather then leaving in the one list I
check. 

I'm looking to make this better in any way, while still keeping my sanity :)


Right now, each line of ~15 rules averages 5.75K use of memory. Thats .38K
per domain :)

--Chris



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to