Re: regex for l33t speak

Chris Devers Wed, 23 Mar 2005 18:35:08 -0800

On Wed, 23 Mar 2005, Andrew Gaffney wrote:

> I'm trying to come up with a regex for my IRC bot that detects 1337 
> (in order to kick them from the channel).


For those unfamiliar with 'leet, see here:

    <http://www.microsoft.com/athome/security/children/kidtalk.mspx>
    <http://en.wikipedia.org/wiki/Leetspeak>
    <http://www.straightdope.com/columns/030110.html>

> I can't seem to come up with one that will have few false positives 
> but also work most of the time. Has anyone done something like this 
> before? Does anyone have any suggestions?

I strongly suspect that there is no general solution for this.

The problem is that the set you're trying to match against is completely 
unbounded, and the whole point of 'leet is to be unconventional with 
rules for spelling, grammar, diction, courtesy, etc.

You could go halfway with code to catch the most common terms -- 1337, 
w00t, pr0n, warez, 0\/\/n3d, etc -- but note how dissimilar those are. 

 * One is all numbers, while another is all letters, so they both look 
   like normal text.

 * You could consider a rule to catch ones with mixed numbers & letters, 
   but that would catch legit terms like "perl6", "md5", or "mp3".

 * One mixes in punctuation, so now you have to deal with anywhere that
   alphanumeric characters are adjacent to symbols. Like, for example,
   everywhere you have a comma, a hyphentated-word, or: a period. Nuts!

Ultimately, you can't win. If the users can guess what the matching 
patterns might be -- and remember, this is IRC, so assume that they'll 
talk to each other as they figure things out -- then they can *always* 
come up with text that will get around your filters.

The most reasonable approach is probably to set up some hard-coded rules 
for the most common terms -- see the URLs above for examples -- and some 
very broad rules to warn (but *not* kick) possible offenders, and with 
that have actual human moderators to catch whatever slips through.

Anything more aggressive than that and you're going to be buried in a 
pile of false positives & false negatives... :-/



-- 
Chris Devers

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: regex for l33t speak

Reply via email to