R (Chandra) Chandrasekhar wrote:
Dear Folks,
I am trying to construct a regular expression to extract strings having
the structure
<a href="http://...">
from HTML files, as part of learning regexes. I have used the script
below to do this:
----------
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;
# Detect and print out all instances of
# <a href="http://..">
# in any file that is input
my ($fh, $file);
# Optionally slurp the file as a paragraphs or as a single record
# $/ = "";
# $/ = undef;
while (@ARGV)
{
$file = shift;
open $fh, '<', $file or die "Cannot open $file: $!\n";
while (<$fh>)
{
if (m|(<a\s*href="\s*http://.*?">)|sg) {print "$1\n";}
}
close $fh;
}
print "Finished\n";
----------
It seems to work OK on most links in the files I have tried out, but not
on one particular href in a valid HTML file, which I have extracted below:
----------
Retrieved 29 June 2006 from <tt><a href="
http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t124.e024%
0">
You are examining one line at a time, while that string spans over
multiple lines.
Can someone please tell me what I have done wrong and what I must do to
extract the above href automatically?
Slurp respective file into a scalar variable:
{
local $/;
my $content = <$fh>;
print "$1\n" while $content =~ m|(<a\s*href="\s*http://.*?">)|gis;
}
(I also added the /i modifier.)
Furthermore, please read the FAQ entry "perldoc -q URL".
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/