Re: extracting substrings from string using regexp

John Doe Mon, 12 Dec 2005 14:11:11 -0800

Owen am Montag, 12. Dezember 2005 22.10:
> Xavier Noria wrote:
> > On Dec 12, 2005, at 11:10, Alexandre Checinski wrote:
> >> I have a string that looks like this :
> >> <counter id="183268" since="SDOPERFV16" aggr="Sum"
> >> name="pcmTcuFaultOutOfService"/>
> >
> > m// in list contex may help:
> >
> >     my ($id, $name) = $xml =~ m{id="([^"]*)".*name="([^"]*)"/>};
>
> I despair of ever understanding REs


You will, just play around with it:

for example, make a "quick and dirty" small script along the lines

=start=

use strict;
use warnings;

my $teststring='something to test';

my $ok=$teststring=~m~sts~; # <<< play around here

print $ok ? 'yes, I got it!!!' : 'I despair of ever understanding REs';

=end=

If you think a regex does something, test the something with above script.
keep open some manuals:
perldoc perlre
perldoc perlretut
perldoc perlrequick

>
> How does the above work
>
> m     Match
> {     inside these braces (as the delimiter?)

No; the {} are in place of the usual //. That's why the 'm' after '=~' is 
mandatory. Same holds for substitution. Sometimes the regexes are more 
readable if somethings else than '//' is used, for example when matching 
(unix) paths.

> id="  the characters   id="
> (     Start the capture for $id
> [^"]  The list of characters beginning with "

Not exactly; The list of chars *not* matching '"', thus the caret just after 
'['.

>       But wasn't that done on line 3 where we
>       looked for a  "
> *     any number of characters

(including none)

> )     end of capture for $id
> "     the end  " for the data element captured
> .*    anything until

more precicely: nothing or anything in "greedy"-mode until

> name="        etc do it all again till
>
> }     ending delimiter
>
> So I have trouble with [^"]*

This means: none or more characters not being a '"'.

>
> What words describe that expression please

my ($id, $name) = $xml =~ m{id="([^"]*)".*name="([^"]*)"/>};

Extract two values $id and $name from the string $xml.

Do that by searching the literal string 'id="';
then look for someting between two doublequotes, whereby the thing between 
must not contain a doublequote, and catch it into $1;
then skip everything until the literal string 'name=';
then look for someting between two doublequotes, whereby the thing between 
must not contain a doublequote, and catch it into $2;
then match a directly following literal string '/>'.
Finally, assign ($1, $2) to the list ($id, $name).

The regex could be improved a bit, I think:

1. it would be less restrictive to allow spaces around '=' and before '/>'
2. there is a problem with the '.*' in the middle: if there are several tags 
    containing a name attribute, it will match the 'name=' of the last tag 
    containing a name attribute. This is because '.*' is greedy.
3. I'm not sure, but I think there must be a space between an attribute value   
    and the next attribute name

This leads to

m{id\s*=\s*"([^"]*)".+?name\s*=\s*"([^"]*)"\s*/>};

But even this version could be improved 
(f.e. it can't handle escaped doublequotes (\") within the 
attribute values. I'm not sure, but I think this is not allowed, but could be 
used to trick the regex doing the wrong thing)

Somebody please correct me if I'm wrong, thanks, I'm overworked (beside not 
being a guru)

hth, joe


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: extracting substrings from string using regexp

Reply via email to