On Thu, 24 Mar 2005, Randy W. Sims wrote:
> The only problem with that is that a dictionary is required for
> it to work because each "symbol" can have multiple translations.
Not only that -- a 'leet word could have multiple possible meanings.
For example, "pwn" ("own") could just be a typo for "pawn".
Any attempt to get back from a 'leet term to real word is going to be
extremely prone to false positives & false negatives. You could cheat
and assume a list of banned words and suspect words, and try to find
probable correlations between the two sets, but that's logically wrong:
you're starting from the conclusion that every word is probably banned,
then digging through what you find until you get what you wanted. The
false positive rate will be huge with such an approach, but it's about
the only approach that has a chance of working at all.
The problem of differentiating between 'leet and conventional English is
very similar to the problem of detecting spam and "ham" email. In that
case, you can use various approaches that do a decent guesstimate --
Bayesian statistical filters, various hard-wired heuristics, a cocktail
of both approaches, etc -- but there's *always* going to be some level
of both false negatives (spam or 'leet that gets through) and false
positives (good messages that get blocked). This is unavoidable -- all
you can do is make reasonable attempts to minimize it.
Maybe the IRC bot should be hooked up to SpamAssassin :-)
--
Chris Devers
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>