: I thought I was improving at expressions but this one has me stumped:
: 
: I have text interspersed with numbers.  The text can be anything, including
: all types of punctuation marks.

Well (sound of knuckles cracking), let's see...

I've only been on the list a couple of days, and I've already seen a
couple of questions about regexes matching numbers. I've never found a
slick way to do this- it's too easy to write one that matches an empty
string- so I usually use a more bull-headed approach (we'll use the
extended regular expression syntax, so we can break the regex up over
several lines & include comments):

my $float_re = qr{

          \d+\.\d+  # Matches "2.3"
        | \d+\.     # Matches "2."
        |    \.=d+  # Matches ".2"
        | \d+       # Matches "2"

}x; # "x" means "extended regex syntax"

while (whatever) {
        my @numbersInText = /($float_re)/og;
          # Quiz: What does the above line do? ;)
}

This will match numbers like those in the comments, checking them in
the order that they're listed. Fine, but oops, we're not picking up the
minus signs. So we'll modify it (and throw in plus signs while we're at
it):

          [-+]? \d+\.\d+
        | [-+]? \d+\.
        | [-+]?    \.\d+
        | [-+]? \d+

(The spaces don't count as part of the regex in extended mode.) Better,
but 2-4 still is not done as "2 4". So, we need to make sure that the
character just before the minus sign is not a digit, but we don't want
to count it as part of the matched substring.  So, we'll use a
zero-length look-behind assertion:

          (?<=\D) [-+]? \d+\.\d+
        | (?<=\D) [-+]? \d+\.
        | (?<=\D) [-+]?    \.\d+
        | (?<=\D) [-+]? \d+

What does "(?<=\D) [-+]? \d+\.\d+" do? It says: Find a non-digit (but
don't include it in the match string), followed possible by either "-"
or "+", and a digit-dot-digit pattern. Assertions take a little getting
used to, but they're powerful stuff (I think even C# has them).

Now that's a lot to type, so we'll do one more trick to clean it up by
factoring out the common stuff and enclosing the rest in (?: ... ),
which groups without leaving anything in a backreference ($1, $2, etc):

my $float_re = qr{
        (?<=\D)       # Look for a non-digit, but don't include it
        [-+]?         # Could have - or +
        (?:
            \d+\.\d+  # Matches "2.3"
          | \d+\.     # Matches "2."
          |    \.\d+  # Matches ".3"
          | \d+       # Matches "2"
        )
}x;

This regex checks against your example. Hope this helps.

--

Tim Kimball · ACDSD / MAST        ¦ 
Space Telescope Science Institute ¦ We are here on Earth to do good to others.
3700 San Martin Drive             ¦ What the others are here for, I don't know.
Baltimore MD 21218 USA            ¦                           -- W.H. Auden

Reply via email to