Dear Folks,

I am trying to construct a regular expression to extract strings having the structure

<a href="http://...";>

from HTML files, as part of learning regexes. I have used the script below to do this:
----------
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;

# Detect and print out all instances of
# <a href="http://..";>
# in any file that is input

my ($fh, $file);

# Optionally slurp the file as a paragraphs or as a single record
# $/ = "";
# $/ = undef;

while (@ARGV)
   {
   $file = shift;
    open $fh, '<', $file or die "Cannot open $file: $!\n";
    while (<$fh>)
        {
        if (m|(<a\s*href="\s*http://.*?";>)|sg) {print "$1\n";}
        }
    close $fh;
    }
print "Finished\n";
----------

It seems to work OK on most links in the files I have tried out, but not on one particular href in a valid HTML file, which I have extracted below:
----------
Retrieved 29 June 2006 from <tt><a href="
  http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t124.e024%
0">
  
http://www.oxfordreference.com/views/ENTRY.html?subview=Main&amp;entry=t124.e024%
0</a></tt>
----------
(The HTML file from which this fragment was extracted, was converted automatically from LaTeX to HTML, and passes HTML 4.0 transitional syntax. It also renders correctly on a browser).

The octal dump for this input file is:
----------
0000000   R   e   t   r   i   e   v   e   d       2   9       J   u   n
0000020   e       2   0   0   6       f   r   o   m       <   t   t   >
0000040   <   a       h   r   e   f   =   "  \n           h   t   t   p
0000060   :   /   /   w   w   w   .   o   x   f   o   r   d   r   e   f
0000100   e   r   e   n   c   e   .   c   o   m   /   v   i   e   w   s
0000120   /   E   N   T   R   Y   .   h   t   m   l   ?   s   u   b   v
0000140   i   e   w   =   M   a   i   n   &   e   n   t   r   y   =   t
0000160   1   2   4   .   e   0   2   4   %  \n   0   "   >  \n
0000200   h   t   t   p   :   /   /   w   w   w   .   o   x   f   o   r
0000220   d   r   e   f   e   r   e   n   c   e   .   c   o   m   /   v
0000240   i   e   w   s   /   E   N   T   R   Y   .   h   t   m   l   ?
0000260   s   u   b   v   i   e   w   =   M   a   i   n   &   a   m   p
0000300   ;   e   n   t   r   y   =   t   1   2   4   .   e   0   2   4
0000320   %  \n   0   <   /   a   >   <   /   t   t   >
0000334
----------

Can someone please tell me what I have done wrong and what I must do to extract the above href automatically?

Many thanks in advance.

Chandra
03 Feb 08

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to