On Wed, 31 Dec 2003 12:09:20 -0500 (EST), Charles Gregory wrote: > Coded Rule: > BODY RULENAME /a{1,3} s{1,3}t{1,3}r{1,3}i{1,3}n{1,3}g{1,3}/i
Another idea could be to use some less precise text matching. Check the following modules that could all be used for matching: Fuzzy string matching: String::Approx http://search.cpan.org/~jhi/String-Approx-3.23/Approx.pm The Metaphone phonetic word encoding algorithm: Text::TransMetaphone Text::DoubleMetaphone Text::Metaphone http://search.cpan.org/~mschwern/Text-Metaphone-1.96/Metaphone.pm http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.05/DoubleMetaphone.pm http://search.cpan.org/~dyacob/Text-TransMetaphone-0.06/lib/Text/TransMetaphone.pm The older (outdated but still popular) Soundex phonetic encoding algorithm: http://search.cpan.org/~markm/Text-Soundex-3.02/Soundex.pm All of these could help catch misspelled words in soam, and there's probably other implemented algorithyms on CPAN that could be used as well. One thought would be something like this: A rule defines certain words, optionally in a certain order. An exact, a phonetic, a fuzzy and a combined fuzzy-phonetic (or maybe phonetic-fuzzy) match is done for each word in the rule. Also, if the rule is a phrase (that is, it also defines the order of the words), the same matches are done for the phrase. Of course, if we do find an exact match, we won't do the fuzzy matching, etc. Once we get a matche, a score is created. The rule specifies a basic score, wich is modified depending on what kind of match we have. Something like this: Exact match score = score Fuzzy match score = score*fuzzymodifier Phonetic match score = score*phoneticmodifier Fuzzy-phonetic match score = score*fuzzymodifier*phoneticmodifier Phonetic-fuzzy match score = score*fuzzymodifier*phoneticmodifier Some experimentation is of course needed for this. Especially for what we consider a match. A couple of years ago I implemented (in Pascal) a combination of metaphone and fuzzy string matching (as an experiment) in i command line interpreter. It worlked rather well. When matching the users command against the table of commands it gave each match a score. An exact match got a very good, and the less exact it was the worse the score it got. So, a match never returned "true" or "false", it retuned a match score. If something like that is used in SA, the calculations would depend on the match score, and the score ought to be a floating point number ranging from 0 to 1. So an exact match returns a 1, an inexact match returns something less than 1 and greater than 0, a 0 is something that didn't match at all in any of the algorithms. Since the result of this match is between 0 and 1, the above calculation becomes a lot simpler. Like: effective_score = defined_score * match_result; If you download and check my implementation (look below for info), please note that my implementation does the scores differently from my above suggestion. The intepreter implementation starts with a base score of 10000, and then adds and subtracts from this when matching. It then has somethresholds that decides wether it can consider a command found, ambigous or not found. The use of those thresholds involve comparisons of scores bweteen matched commands. This also means that tye maximum amount of points a match can get depends on the length of the command. Obciously one can't just this straight of for SA, but it can still work as a demonstration of the combination of a fuzzy string match and metaphone. If this makes anyone curious, the uncommented source for my parser can be found in: http://vvv.truls.org/pascal/UNITS/PARSER.PAS Look for TParser.C_Parse wich is the method that does the actual fuzzy-phonetic scoring. Some routines it uses are in: http://vvv.truls.org/pascal/UNITS/USTRINGS.PAS http://vvv.truls.org/pascal/UNITS/DMETAPHN.PAS http://vvv.truls.org/pascal/UNITS/DMetaPhn.Dat Is it uses other units and you want to compile it, search for those first in: http://vvv.truls.org/pascal/UNITS/ and then in http://vvv.truls.org/pascal/ An actual, and *swedish*, demo implementation of a command line interpreter using this can be found in http://vvv.truls.org/pascal/PARSER/ and a compiled demonstration in http://vvv.truls.org/pascal/PARSER/t/ where Kommando.Exe is a DOS-binary and Kommand(lnx) is a Linux-binary. As I stated above, this demo is swedish. That means that all the kommands and readme stuff are in swedish. It ought to still be possible to use as a hint as to how good/bad the matching is for people that can't understand what the commands actually means. As the commands doesn't really do anything, it shouldn't matter too much. The command "Debug" tells it to print all matches it found and their scores, as well as show the score bases. The command "Quit" exits. The combined fuzzy+phonetic matching actually works quite well for this purpuse, and I think it should defnitiley be possible to get something similar to work well for spam identification. It does require some experimentation though, and a lot of testing. Regards /Jonas -- Jonas Eckerman, [EMAIL PROTECTED] http://www.fsdb.org/ ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk