Paul Johnson wrote:
On Thu, Mar 24, 2005 at 02:25:19AM -0600, Andrew Gaffney wrote:

Randy W. Sims wrote:

Andrew Gaffney wrote:


I'm trying to come up with a regex for my IRC bot that detects 1337 (in order to kick them from the channel). I can't seem to come up with one that will have few false positives but also work most of the time. Has anyone done something like this before? Does anyone have any suggestions?


Write a converter to translate common "symbols" to the correct letter. If the translated "word" is a valid dictionary word, flag it.


[EMAIL PROTECTED]
3 => E
X => X
@ => A
m => M
P => P
1 => L
e => E

[EMAIL PROTECTED] => EXAMPLE

EXAMPLE is a dictionary word, so [EMAIL PROTECTED] must be leet since the conversion rules produced meaningful results.

It's not perfect, but should work with very few if any false positives.

Thanks for yet another very interesting approach.


Check out Lingua::31337 on CPAN.  That C really does stand for
comprehensive.

It works the other way around, ie it converts normal text to 31337, but
you coud probably reverse the conversions it uses.  Best of all, it's
written by the founder of this list (hi Casey!) but I don't think it has
ever been plugged here.  It's about time that was remedied.

I'm sure Casey would be happy to accept a patch to add a 313372text
function.

The only problem with that is that a dictionary is required for it to work because each "symbol" can have multiple translations. Taking info from the wikipedia[1]: a final "s" can be changed to "z" to get the l33t, but to reverse it you have to check first with the "z" because it might be an actual "z". Then if it is not a dictionary word perform the translation and check for a word ending in "s".


For example, given the l33t word "h4x0rz", an algorithm would have to perform something like the following translations, checking each one till it finds a dictionary entry if any:

(done by hand and I don't know much about l33t, so...)

h4x0rz
h4x0rs
h4xorz
h4xors
h4xerz
h4xers
h4ck0rz
h4ck0rs
h4ckorz
h4ckors
h4ckerz
h4ckers
h4cks0rz
h4cks0rs
h4cksorz
h4cksors
h4ckserz
h4cksers
hack0rz
hack0rs
hackorz
hackors
hackerz
hackers => BINGO

(More permutations here, but we already found a dictionary word, so we stop.)

The basic algorithm for anyone who want to try it, and it's pretty commonly seen in parsing, so it's relatively straigtforward:

scan string till you reach the end of a "word"
check dictionary for the "word"
LOOP:
  back up
  apply conversion(s)
  check dictionary
  repeat until success or no more permutations
END LOOP:


This would probably make a good QotW, or rather the original question would make a good quiz while the above would be one possible solution. So would implementing an efficient dictionary lookup without loading the entire dictionary in memory.


Randy.

1. <http://en.wikipedia.org/wiki/Leetspeak>


-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>




Reply via email to