On Thu, 3 Aug 2006, Coffey, Neal wrote:
I'm trying to create a rule to catch some of the perscription drug
references that come into our system. We're not in pharmaceuticals, so
I'm not too concerned about false positives :)
Some examples of what I'm looking for (using an innocent drug so I don't
trip someone else's filters):
ADVwIL
ADxDVIL
ADxV1L
Advjjl
For what it's worth, I thought all spams of that form were prescription
drug spams, but recently I got one like this as well:
Subject: Re: nunocREjPLICA
OMxEGA
ROxLEX
BRxEITLING
CAxRTIER
BVxLGARI
PAxTEK
TIxFFANY & CO
Or summed up in english: insertion of a random character, the same thing
but with a letter repeated, inserted character and "1" (or "l") instead
of "I", and the recent (and odd) occurrence of "I" replaced with "jj".
I've come up with a rule that'll match every one of those instances, but
also has the unfortunate consequence of matching plain old "ADVIL":
/A[a-z]?A?D[a-z]?D?V[a-z]?V?[Il1j][a-z]?[Il1j]?L[a-z]?L?/
I'm fairly sure there is no sane way to do this with "?"
operators in a regexp.
However, there is one obvious way to do it. Like this:
/A.DVIL|AD.VIL|ADV.IL|ADVI.L/
Basically, if there is exactly one extra character, then it will
have to occur in one of 4 positions (in a 5-character word),
assuming it doesn't occur at the very beginning or very end.
So, you have 4 possible paths to take through the regexp,
one for each position that the extra character occurs in.
Since the first and last characters of all four branches are
always the same, you can optimize it a tiny bit by factoring
out the common parts of the branches:
/A(?:.DVI|D.VI|DV.I|DVI.)L/
Hope that helps.
- Logan