On Wed, 24 Dec 2003 10:59:50 -0500, Chris Santerre <[EMAIL PROTECTED]> writes:
> Updated from last few days. Rules 20-23 have been played with a little. > Attempting to make the ruleset faster. I have some issues with doing the > rules this way, so I'm testing them out. Are you having trouble doing the conversion automatically? I can describe the algorithm to transform the regexps and to find maximum-size prefixes if you (or someone else) wants to implement. I've tried, but my perl knowledge for the datastructure voodoo is a bit lacking, but the correct algorithm will give a new ruleset that will have *identical* results to doing the matches sequentially. The program for the conversion should be about 30-50 lines. The rough algorithm is to put all of the domain names into a trie. A trie is a tree datastructure except that all nodes have a maximum of 257 children --- one child for each possible character, and a magic LEAF indicating the end of each string: f --- o ---- o --- LEAF \ \ | ---- r ---- g ---- e ---- t ---- LEAF | \ \ ---o --- d --- LEAF \ \ \ LEAF \ \u --- n ---- LEAF has foo, forget, food, fo, and fun. For any node, the path from the root to the node is a prefix of every string in the subtree. A recursive traversal over the trie to decompose it into subtrees that contain every leaf is enough. For each subtree, you have a prefix string (the path from the root to the root of the subtree) AND a list of all of the other strings in the subtree (by traversing it to look for LEAF's). From this, you can make one rule. <prefix>(<leaf1>|<leaf2>|<leaf3>|...) However, if the list of strings is too large, it may be better to split the list into several lists and output several rules. A simple recursive decomposition algorithm is: #1 if the current node has 1 child, recurse into it, return what it returns. #2 if the current subtree has <40 LEAF's in it, generate a rule #3 ELSE for each child, recurse into it A more sophisticated algorithm that returns unhandled strings to the parent to aggregate and deal with will avoid making a large number of short rules is: #1 if the current node has 1 child, recurse into it, return what it returns. #2 if the current subtree has 5<x<40 LEAF's in it, generate a rule and return. #3 if the current subtree has x<5 LEAF's in it, return a list of those unhandled strings. #4 ELSE [40<x] for each child, recurse into it. concatenate all of the lists of unhandled strings. if the list of unhandled strings is >40, generate a rule. ELSE return the unhandled strings. If someone implements it, I can help debug. Scott ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk