Re: regular expression help

Rob Dixon Mon, 24 Jul 2006 14:55:29 -0700

Jonathan Weber wrote:
>
> Hi. I have some HTML files with lines like the following:
>
> <a name="w12234"> </a> <h2>A Title</h2>
>
> I'm using a regular expression to find these and capture the name
> attribute ("w12234" in the example) and the contents of the h2 tag ("A
> Title").
>
> $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/
>
> That's my regex, except I'm having trouble with the _____ part. No
> matter what I seem to try, it won't match incidences where there's a
> newline somewhere in the string. I tried all manner of things,
> including [.\n], which if I understand correctly should match
> *everything*.
>
> I'm doing this on Windows; does the carriage return/line feed business
> have anything to do with this?


Hi Jonathan.

Some points:

- The character wildcard '.' is just a dot within a character class, so [.\n]
will match only a dot or a newline

- The /s modifier will force '.' to match absolutely anything, including a
newline. So you could write:

  $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/s;

but that isn't what you want as /.+/ will eat up all of the rest of the string
until the last </h2> it finds. You could get away with /.+?/ but nicer is
/[^<]+/ which will match any number of any character except for an open angle
bracket

- If you're matching against $_ then you can omit it altogether:

  /<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/;

does the same thing

- Enclosing a regex in slashes allows you to omit an implied m// operator, which
you have (i.e. /regex/ is the same as m/regex/). Putting the m back lets you use
whatever delimiters you want, so you don't have to escape the contained slashes
and can make it more readable:

  m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#;

- Regexes aren't the best way of parsing HTML, unless the document is very
simple and predictable. Take a look at somthing like HTML::TreeBuilder if you're
doing this a lot on varying or non-trivial documents.

- This program does what you want:

  use strict;
  use warnings;

  my $string = <<HTML;
  <a name="w12234"> </a> <h2>A
  Title</h2>
  HTML

  $string =~ m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#;

  print $1, "\n";
  print $2, "\n";

OUTPUT

  w12234
  A
  Title


I hope this helps.

Rob

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: regular expression help

Reply via email to