At 8/24/03 10:40 PM , Brian Ipsen wrote:

>V*l*A*G*R*A F*R*E*E!

I assume you need to check for the V-word ;-) - assuming, that the spammers
might put a space, a star or a '-' between the letters... What would such a
rule look like ??

Here's my take on this issue:


Characters I've seen used as separators include: space, -, _, =, + and *. So /[ _+=\*\-]/ will catch all of those. You need the \ before the -, because otherwise the regex engine will try to interpret it as meaning "=-*", i.e., "anything in the range from = to *". (I think it then interprets that ASCIIbetically, meaning =<;:9876543210/.-,+* -- which is not what you want.) I'm not sure if the \ is needed before * when inside [brackets], but it doesn't hurt to include it.

Alternatively, you could just use \W, which matches "any non-word character", that is, anything except [a-zA-Z0-9_]. This means you won't catch v_i_a_g_r_a, so you can use [\W_]. That way, if they suddenly start using ' or . as a separator, you're already a step ahead of them.

Spellings of the v-word can involve either "a" being replaced by @ (or maybe even by 4 if they're feeling l337), and the "i" being replaced by either !, an l (lowercase ell) or 1 (the numeral one). [EMAIL PROTECTED] catches the first; [i!1l] catches the second.

Spammers with really bad spelling might also think it's spelled "viagara". Since that "a" might also get replaced by @ or 4, you need to take that [EMAIL PROTECTED] part, add the following separator character [\W_], enclose them all in (parens) to group them as one thing, and tack a ? after it to say "this is optional". Might as well also put the :? operator right after the first paren to let Perl know it doesn't need to save the contents of that match, so: /(?:[EMAIL PROTECTED])?/

So a full regex to catch all of these would be:

/[EMAIL PROTECTED](?:[EMAIL PROTECTED])[EMAIL PROTECTED]/

I see no reason to anchor this on word boundaries (by putting \b before and after it); I can't imagine any situation in which you'd find this string buried in something longer and it would actually be something valid that you didn't want to tag. (The most famous example of a case where you *do* want word-boundary anchoring is "cunt", found in the name of the town of Scunthorpe in England. If AOL's filters had been using /\bcunt\b/ instead of just /cunt/, users from Scunthorpe wouldn't have been annoyed, and the rest of the world wouldn't have laughed at AOL so much.)

Do any of the other regex gurus around here see any problem with that reasoning?

                                                --Kai MacTane
----------------------------------------------------------------------
"There is no faith in which to hide; even truth is filled with lies.
 Doubting angels fall to walk among the living.
 I'm in this mood because of scorn, I'm in a mood for total war.
 To the darkened skies once more, and ever onward!"
                                                --VNV Nation,
                                                 "DarkAngel"



-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines
at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to