[SAtalk] Phonetic/fuzzy matching (was: REQUEST TO SPAMASSASSIN AUTHORS)

Jonas Eckerman Wed, 31 Dec 2003 12:51:31 -0800

On Wed, 31 Dec 2003 12:09:20 -0500 (EST), Charles Gregory wrote:

>  Coded Rule:
>  BODY RULENAME /a{1,3} s{1,3}t{1,3}r{1,3}i{1,3}n{1,3}g{1,3}/i


Another idea could be to use some less precise text matching. Check the following 
modules that could all be used for matching:

Fuzzy string matching:
String::Approx
http://search.cpan.org/~jhi/String-Approx-3.23/Approx.pm

The Metaphone phonetic word encoding algorithm:
Text::TransMetaphone
Text::DoubleMetaphone
Text::Metaphone
http://search.cpan.org/~mschwern/Text-Metaphone-1.96/Metaphone.pm
http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.05/DoubleMetaphone.pm
http://search.cpan.org/~dyacob/Text-TransMetaphone-0.06/lib/Text/TransMetaphone.pm

The older (outdated but still popular) Soundex phonetic encoding algorithm:
http://search.cpan.org/~markm/Text-Soundex-3.02/Soundex.pm

All of these could help catch misspelled words in soam, and there's probably other 
implemented algorithyms on CPAN that could be used as well.

One thought would be something like this:

A rule defines certain words, optionally in a certain order.
An exact, a phonetic, a fuzzy and a combined fuzzy-phonetic (or maybe phonetic-fuzzy) 
match is done for each word in the rule.
Also, if the rule is a phrase (that is, it also defines the order of the words), the 
same matches are done for the phrase.

Of course, if we do find an exact match, we won't do the fuzzy matching, etc.

Once we get a matche, a score is created. The rule specifies a basic score, wich is 
modified depending on what kind of match we have. Something like this:

Exact match score =  score
Fuzzy match score = score*fuzzymodifier
Phonetic match score = score*phoneticmodifier
Fuzzy-phonetic match score = score*fuzzymodifier*phoneticmodifier
Phonetic-fuzzy match score = score*fuzzymodifier*phoneticmodifier

Some experimentation is of course needed for this. Especially for what we consider a 
match.

A couple of years ago I implemented (in Pascal) a combination of metaphone and fuzzy 
string matching (as an experiment) in i command line interpreter. It worlked rather 
well. When matching the users command against the table of commands it gave each match 
a score. An exact match got a very good, and the less exact it was the worse the score 
it got. So, a match never returned "true" or "false", it retuned a match score.

If something like that is used in SA, the calculations would depend on the match 
score, and the score ought to be a floating point number ranging from 0 to 1. So an 
exact match returns a 1, an inexact match returns something less than 1 and greater 
than 0, a 0 is something that didn't match at all in any of the algorithms.

Since the result of this match is between 0 and 1, the above calculation becomes a lot 
simpler. Like:

effective_score = defined_score * match_result;

If you download and check my implementation (look below for info), please note that my 
implementation does the scores differently from my above suggestion.

The intepreter implementation starts with a base score of 10000, and then adds and 
subtracts from this when matching. It then has somethresholds that decides wether it 
can consider a command found, ambigous or not found. The use of those thresholds 
involve comparisons of scores bweteen matched commands. This also means that tye 
maximum amount of points a match can get depends on the length of the command.

Obciously one can't just this straight of for SA, but it can still work as a 
demonstration of the combination of a fuzzy string match and metaphone.

If this makes anyone curious, the uncommented source for my parser can be found in:
http://vvv.truls.org/pascal/UNITS/PARSER.PAS
Look for TParser.C_Parse wich is the method that does the actual fuzzy-phonetic 
scoring.
Some routines it uses are in:
http://vvv.truls.org/pascal/UNITS/USTRINGS.PAS
http://vvv.truls.org/pascal/UNITS/DMETAPHN.PAS
http://vvv.truls.org/pascal/UNITS/DMetaPhn.Dat
Is it uses other units and you want to compile it, search for those first in:
http://vvv.truls.org/pascal/UNITS/
and then in
http://vvv.truls.org/pascal/

An actual, and *swedish*, demo implementation of a command line interpreter using this 
can be found in
http://vvv.truls.org/pascal/PARSER/
and a compiled demonstration in
http://vvv.truls.org/pascal/PARSER/t/
where Kommando.Exe is a DOS-binary and Kommand(lnx) is a Linux-binary.

As I stated above, this demo is swedish. That means that all the kommands and readme 
stuff are in swedish. It ought to still be possible to use as a hint as to how 
good/bad the matching is for people that can't understand what the commands actually 
means. As the commands doesn't really do anything, it shouldn't matter too much.

The command "Debug" tells it to print all matches it found and their scores, as well 
as show the score bases. The command "Quit" exits.

The combined fuzzy+phonetic matching actually works quite well for this purpuse, and I 
think it should defnitiley be possible to get something similar to work well for spam 
identification. It does require some experimentation though, and a lot of testing.

Regards
/Jonas
--
Jonas Eckerman, [EMAIL PROTECTED]
http://www.fsdb.org/



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Phonetic/fuzzy matching (was: REQUEST TO SPAMASSASSIN AUTHORS)

Reply via email to