Re: Regular expression for extracting hrefs from HTML file

Gunnar Hjalmarsson Sun, 03 Feb 2008 18:03:13 -0800

R (Chandra) Chandrasekhar wrote:

Dear Folks,
I am trying to construct a regular expression to extract strings havingthe structure
<a href="http://...";>
from HTML files, as part of learning regexes. I have used the scriptbelow to do this:
----------
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;

# Detect and print out all instances of
# <a href="http://..";>
# in any file that is input

my ($fh, $file);

# Optionally slurp the file as a paragraphs or as a single record
# $/ = "";
# $/ = undef;

while (@ARGV)
   {
   $file = shift;
    open $fh, '<', $file or die "Cannot open $file: $!\n";
    while (<$fh>)
        {
        if (m|(<a\s*href="\s*http://.*?";>)|sg) {print "$1\n";}
        }
    close $fh;
    }
print "Finished\n";
----------
It seems to work OK on most links in the files I have tried out, but noton one particular href in a valid HTML file, which I have extracted below:
----------
Retrieved 29 June 2006 from <tt><a href="
http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t124.e024%0">

You are examining one line at a time, while that string spans overmultiple lines.

Can someone please tell me what I have done wrong and what I must do toextract the above href automatically?


Slurp respective file into a scalar variable:

  {
    local $/;
    my $content = <$fh>;
    print "$1\n" while $content =~ m|(<a\s*href="\s*http://.*?";>)|gis;
  }

(I also added the /i modifier.)

Furthermore, please read the FAQ entry "perldoc -q URL".

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Regular expression for extracting hrefs from HTML file

Reply via email to