On Thu, 5 Oct 2006, Bowie Bailey wrote:
>
>I would think the problem with computing distances is preventing false
>matches on normal words.  Consider these:
>
>    hunt
>    shot
>    dice
>    fits
>
>These are all a distance of 1 from words you might want to look for.

Heh, good examples.

While that's true, you probably wouldn't score something based on just
one or two hits.  It'd be sort of like weighted bayes -- each hit
would get scored based on the length of the target word and the
closeness of the hit, and that would create a composite score for the
message.  You would then do SA scoring based on this composite.

So, to use a completely spurious example, let's assume that each word
points equal to (length of target / distance).  So this message:

"We went on a hunt, and then shot some dice."

Might have three hits: hunt, shot, and dice.  (Although I'm not sure
you'd want to match either shot or dice.)  That would give it a score
of 12.

The example message earlier had at least six misspelled words:  orrgy,
pornn, bitchees, annal, orgiees, and ssex.  That would give a
composite score of 26.  Your breaking point would be somewhere between
12 and 26.

In reality, I think you'd want to use an exponential scale for the
length, so that a six-letter word counts for (e.g.) twice as much as a
five-letter word.  If we did that -- word score = (length of target ^
2 / distance) -- then the innocent sample message would score a 48 and
the guilty sample would score a 118, yet a larger gap
(magnitude-wise).

Another option would be to use a combination of Levenshtein distance
and an algorithm like metaphone for representing the pronunciation of
a word.  So levenshtein(metaphone("orgy", "orrgy")) == 0, but
levenshtein(metaphone("dice", "dick")) == 1.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University


Reply via email to