On Thu, Mar 24, 2005 at 02:25:19AM -0600, Andrew Gaffney wrote:
Randy W. Sims wrote:
Andrew Gaffney wrote:
I'm trying to come up with a regex for my IRC bot that detects 1337 (in order to kick them from the channel). I can't seem to come up with one that will have few false positives but also work most of the time. Has anyone done something like this before? Does anyone have any suggestions?
Write a converter to translate common "symbols" to the correct letter. If the translated "word" is a valid dictionary word, flag it.
[EMAIL PROTECTED] 3 => E X => X @ => A m => M P => P 1 => L e => E
[EMAIL PROTECTED] => EXAMPLE
EXAMPLE is a dictionary word, so [EMAIL PROTECTED] must be leet since the conversion rules produced meaningful results.
It's not perfect, but should work with very few if any false positives.
Thanks for yet another very interesting approach.
Check out Lingua::31337 on CPAN. That C really does stand for comprehensive.
It works the other way around, ie it converts normal text to 31337, but you coud probably reverse the conversions it uses. Best of all, it's written by the founder of this list (hi Casey!) but I don't think it has ever been plugged here. It's about time that was remedied.
I'm sure Casey would be happy to accept a patch to add a 313372text function.
The only problem with that is that a dictionary is required for it to work because each "symbol" can have multiple translations. Taking info from the wikipedia[1]: a final "s" can be changed to "z" to get the l33t, but to reverse it you have to check first with the "z" because it might be an actual "z". Then if it is not a dictionary word perform the translation and check for a word ending in "s".
For example, given the l33t word "h4x0rz", an algorithm would have to perform something like the following translations, checking each one till it finds a dictionary entry if any:
(done by hand and I don't know much about l33t, so...)
h4x0rz h4x0rs h4xorz h4xors h4xerz h4xers h4ck0rz h4ck0rs h4ckorz h4ckors h4ckerz h4ckers h4cks0rz h4cks0rs h4cksorz h4cksors h4ckserz h4cksers hack0rz hack0rs hackorz hackors hackerz hackers => BINGO
(More permutations here, but we already found a dictionary word, so we stop.)
The basic algorithm for anyone who want to try it, and it's pretty commonly seen in parsing, so it's relatively straigtforward:
scan string till you reach the end of a "word" check dictionary for the "word" LOOP: back up apply conversion(s) check dictionary repeat until success or no more permutations END LOOP:
This would probably make a good QotW, or rather the original question would make a good quiz while the above would be one possible solution. So would implementing an efficient dictionary lookup without loading the entire dictionary in memory.
Randy.
1. <http://en.wikipedia.org/wiki/Leetspeak>
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>