Re: regex for l33t speak

Chris Devers Thu, 24 Mar 2005 20:38:25 -0800

On Thu, 24 Mar 2005, Randy W. Sims wrote:

> The only problem with that is that a dictionary is required for
> it to work because each "symbol" can have multiple translations.


Not only that -- a 'leet word could have multiple possible meanings. 

For example, "pwn" ("own") could just be a typo for "pawn".

Any attempt to get back from a 'leet term to real word is going to be 
extremely prone to false positives & false negatives. You could cheat 
and assume a list of banned words and suspect words, and try to find 
probable correlations between the two sets, but that's logically wrong: 
you're starting from the conclusion that every word is probably banned, 
then digging through what you find until you get what you wanted. The 
false positive rate will be huge with such an approach, but it's about 
the only approach that has a chance of working at all.

The problem of differentiating between 'leet and conventional English is 
very similar to the problem of detecting spam and "ham" email. In that 
case, you can use various approaches that do a decent guesstimate -- 
Bayesian statistical filters, various hard-wired heuristics, a cocktail 
of both approaches, etc -- but there's *always* going to be some level 
of both false negatives (spam or 'leet that gets through) and false 
positives (good messages that get blocked). This is unavoidable -- all 
you can do is make reasonable attempts to minimize it.

Maybe the IRC bot should be hooked up to SpamAssassin :-)



-- 
Chris Devers

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: regex for l33t speak

Reply via email to