Re: Looking for advice on rule creation & regular expressions

Logan Shaw Thu, 03 Aug 2006 08:44:08 -0700

On Thu, 3 Aug 2006, Coffey, Neal wrote:

I'm trying to create a rule to catch some of the perscription drug
references that come into our system.  We're not in pharmaceuticals, so
I'm not too concerned about false positives :)


Some examples of what I'm looking for (using an innocent drug so I don't
trip someone else's filters):

        ADVwIL
        ADxDVIL
        ADxV1L
        Advjjl


For what it's worth, I thought all spams of that form were prescription
drug spams, but recently I got one like this as well:

    Subject: Re: nunocREjPLICA

    OMxEGA
    ROxLEX
    BRxEITLING
    CAxRTIER
    BVxLGARI
    PAxTEK
    TIxFFANY & CO

Or summed up in english: insertion of a random character, the same thing
but with a letter repeated, inserted character and "1" (or "l") instead
of "I", and the recent (and odd) occurrence of "I" replaced with "jj".

I've come up with a rule that'll match every one of those instances, but
also has the unfortunate consequence of matching plain old "ADVIL":

        /A[a-z]?A?D[a-z]?D?V[a-z]?V?[Il1j][a-z]?[Il1j]?L[a-z]?L?/


I'm fairly sure there is no sane way to do this with "?"
operators in a regexp.

However, there is one obvious way to do it.  Like this:

        /A.DVIL|AD.VIL|ADV.IL|ADVI.L/

Basically, if there is exactly one extra character, then it will
have to occur in one of 4 positions (in a 5-character word),
assuming it doesn't occur at the very beginning or very end.
So, you have 4 possible paths to take through the regexp,
one for each position that the extra character occurs in.

Since the first and last characters of all four branches are
always the same, you can optimize it a tiny bit by factoring
out the common parts of the branches:

        /A(?:.DVI|D.VI|DV.I|DVI.)L/

Hope that helps.

  - Logan

Re: Looking for advice on rule creation & regular expressions

Reply via email to