Rob Dixon пишет:
Vyacheslav Karamov wrote:
I've solved my problem with captures, but I don't understand how to get positions of matches:

my $regex = qr
{
(?i) # Case-insensitive
(
[\x{2022}\*]* # Any number of bullet or asterisk characters
[1-9]+ # One or more digits 1-9
\s* # Any number of spaces
(?:
\-|\,|through|and # Zero or one dash, comma or "through" or "and"
)*?
\s* # Any number of spaces
)+
}msx;

my $regex1 = qr
{
(?i) # Case-insensitive
(
(?:
figure | fig[s]?[\.]?? | table | box | chapter | diagram | scheme | chart | plate | appendix | part | section | footnote | [p]{1,2}\.?? | page
)
\s* # Any number of spaces
(?:
[0-9]+ # One or more digits 1-9
\s* # Any number of spaces
(?:
\-|\,|through|and|\s # Zero or one dash, comma or "through" or "and"
)*
\s* # Any number of spaces
)+
)
}msx;

my @vancouverCites =
(
"[4 5, Figure 3; 12 Chapter 4-5]",
"[8, Chapter 10]",
"[9 through 15, pp. 35-46]",
"[11, pp. 35 Through 46]",
"[see 1, 4]",
"[e.g. 2, 5]",
"[e.g. •2, ••5]",
"[e.g. *2, **5]",
"[for example 1,17]",
"[2, 9]",
);

foreach my $c (@vancouverCites)
{
$c =~ s/$regex1//g;
print "Text=\"$c\" ";
my @matches = $c =~ /$regex/g;
foreach my $arr (@matches)
{
print "Array = $arr " if defined $arr;
}
print " pos=$-[0] - $+[0]", "\n";
}


Output:

Text="[4 5, ; 12 ]" Array = 5 Array = 12 pos=12 - 11
Text="[8, ]" Array = 8 pos=5 - 2
Text="[9 through 15, ]" Array = 9 Array = 15 pos=16 - 13
Text="[11, ]" Array = 11 pos=6 - 3
Text="[see 1, 4]" Array = 1 Array = 4 pos=10 - 9
Text="[e.g. 2, 5]" Array = 2 Array = 5 pos=11 - 10
Text="[e.g. •2, ••5]" Array = •2 Array = ••5 pos=14 - 13
Text="[e.g. *2, **5]" Array = *2 Array = **5 pos=14 - 13
Text="[for example 1,17]" Array = 1 Array = 17 pos=18 - 17
Text="[2, 9]" Array = 2 Array = 9 pos=6 - 5

Why $-[0] and $+[0] are *so strange*?

First of all, your code is very difficult to read without proper indendation.
With the luxury of the /x modifier on your regular expressions you can make
things much more legible, like this:

use charnames ':full';

  my $regex = qr {
    (
      [\N{BULLET}\*]* # Any number of bullet or asterisk characters
      [1-9]+ # One or more digits 1-9
      \s* # Any number of spaces
      (?:
        \-|\,|through|and # Zero or one dash, comma or "through" or "and"
      )*?
      \s* # Any number of spaces
    )+
  }msxi;

And the same applies to your loops - they would be very much more readable if
you indented your code properly.

Now the reason for the odd values in @- and @+ are because of your line

  my @matches = $c =~ /$regex/g;

which, because of the /g qualifier, matches the regular expression as many times
as possible. So by definition the last match attempt is a failure and the
contents of @- and @+ are no longer valid. If you write a loop like this
instead, which processes each match individually, you will see what you expect.

  foreach my $citation (@vancouverCites) {

    print qq(Text = "$citation"\n);

    while ($citation =~ /$regex/g) {
      print qq(pos = $-[0] to $+[0]\n);
    }
    print "\n";
  }

I hope this helps,

Rob

Hi, Rob!

I'm confused why your sample works with /g option and doesn't works without it.

P.S. I don't know why, but Thunderbird removes code indentation.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to