Re: regex help

Piers Cawley Thu, 15 Nov 2001 01:26:26 -0800

"Brett W. McCoy" <[EMAIL PROTECTED]> writes:

> On Tue, 13 Nov 2001, A. Rivera wrote:
> 
> 
>> I need help find the most effecient way to do this..
>>
>> I have a variable...
>> $data="this is a test";
>>
>> What is the quickest way to get $data to equal just the first two words of
>> the original variable....
> 
> This splits the string up, extracts the first two words, and joins them
> again and re-assigns to $data:
> 
> $data = join(" ", (split(/\s/, $data))[0..1]);


You may find that
  $data = join('', (split /(\s+)/, $data)[0..2]);

Is a little more tolerant of multiple white space than Brett's answer.
Note the trick we use with split to capture the 'actual' whitespace
used to seperate the words by putting the split pattern in brackets.

You could also make the change by doing:

  $data =~ s/((?:(?:^|\s+)\S+){2}).*/$1/;

Which has the advantage that you can easily change the number of words
you match simply by changing the value in the braces. And if you want
to catch at most 2 words, you'd have {0,2} in there...

I'm not sure which is the fastest; I've not benchmarked it, but it's
generally more important to worry about which is the *clearest*.
Programmer time is far more valuable than processor time.

So if you are sure that your data will never contain more than one
space between words, go with Brett's solution. If it might have more
than one space between words and you don't mind replacing them with a
single space, go with Brett's solution but replace \s with \s+ in the
split pattern.

If you want to be flexible about data, then go with my solution, but
wrap it in a function like so:

    sub truncate_to_n_words {
        my($string, @bounds) = @_;
        croak "Too many bounds" unless scalar @bounds <= 2;
        croak "Not enough bounds" unless scalar @bounds;
        local $" = ','; # makes "@bounds" seperate terms with a comma
        $string =~ s{((?:         # replace
                      (?:^|\s+)   # line start or any number of spaces
                      \S+         # followed by some none-white chars
                      )           # Match this group
                      {@bounds}   # between $bounds[0] and $bounds[1] times
                     )            # And remember it.
                     .*           # Catch the remaining chars.
                    }{$1}x;       # and throw them away.

        $_[0] = $string;          # Modify the original string in place.
    }

    sub truncate_to_2_words {
        truncate_to_n_words($_[0], 2);
    }

The idea being that, yes, the regular expression is ugly, but that
ugliness is hidden away behind a well named function. The code where
you need the behaviour will then look like:

    truncate_to_2_words($string);

Which is substantially clearer than any of the one line solutions.

Of course, it's slower to run and took longer to write, but every time
you revisit code that makes use of it you'll not have to work out
what's going on.
                   
-- 
Piers

   "It is a truth universally acknowledged that a language in
    possession of a rich syntax must be in need of a rewrite."
         -- Jane Austen?

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: regex help

Reply via email to