Jonathan Weber wrote: > > Hi. I have some HTML files with lines like the following: > > <a name="w12234"> </a> <h2>A Title</h2> > > I'm using a regular expression to find these and capture the name > attribute ("w12234" in the example) and the contents of the h2 tag ("A > Title"). > > $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(____+)<\/h2>/ > > That's my regex, except I'm having trouble with the _____ part. No > matter what I seem to try, it won't match incidences where there's a > newline somewhere in the string. I tried all manner of things, > including [.\n], which if I understand correctly should match > *everything*. > > I'm doing this on Windows; does the carriage return/line feed business > have anything to do with this?
Hi Jonathan. Some points: - The character wildcard '.' is just a dot within a character class, so [.\n] will match only a dot or a newline - The /s modifier will force '.' to match absolutely anything, including a newline. So you could write: $_ =~ /<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/s; but that isn't what you want as /.+/ will eat up all of the rest of the string until the last </h2> it finds. You could get away with /.+?/ but nicer is /[^<]+/ which will match any number of any character except for an open angle bracket - If you're matching against $_ then you can omit it altogether: /<a name="(w\d+)">\s*<\/a>\s*<h2>(.+)<\/h2>/; does the same thing - Enclosing a regex in slashes allows you to omit an implied m// operator, which you have (i.e. /regex/ is the same as m/regex/). Putting the m back lets you use whatever delimiters you want, so you don't have to escape the contained slashes and can make it more readable: m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#; - Regexes aren't the best way of parsing HTML, unless the document is very simple and predictable. Take a look at somthing like HTML::TreeBuilder if you're doing this a lot on varying or non-trivial documents. - This program does what you want: use strict; use warnings; my $string = <<HTML; <a name="w12234"> </a> <h2>A Title</h2> HTML $string =~ m#<a name="(w\d+)">\s*</a>\s*<h2>([^<]+)</h2>#; print $1, "\n"; print $2, "\n"; OUTPUT w12234 A Title I hope this helps. Rob -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>