Dear Folks,
I am trying to construct a regular expression to extract strings having the
structure
<a href="http://...">
from HTML files, as part of learning regexes. I have used the script below to do
this:
----------
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;
# Detect and print out all instances of
# <a href="http://..">
# in any file that is input
my ($fh, $file);
# Optionally slurp the file as a paragraphs or as a single record
# $/ = "";
# $/ = undef;
while (@ARGV)
{
$file = shift;
open $fh, '<', $file or die "Cannot open $file: $!\n";
while (<$fh>)
{
if (m|(<a\s*href="\s*http://.*?">)|sg) {print "$1\n";}
}
close $fh;
}
print "Finished\n";
----------
It seems to work OK on most links in the files I have tried out, but not on one
particular href in a valid HTML file, which I have extracted below:
----------
Retrieved 29 June 2006 from <tt><a href="
http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t124.e024%
0">
http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t124.e024%
0</a></tt>
----------
(The HTML file from which this fragment was extracted, was converted
automatically from LaTeX to HTML, and passes HTML 4.0 transitional syntax. It
also renders correctly on a browser).
The octal dump for this input file is:
----------
0000000 R e t r i e v e d 2 9 J u n
0000020 e 2 0 0 6 f r o m < t t >
0000040 < a h r e f = " \n h t t p
0000060 : / / w w w . o x f o r d r e f
0000100 e r e n c e . c o m / v i e w s
0000120 / E N T R Y . h t m l ? s u b v
0000140 i e w = M a i n & e n t r y = t
0000160 1 2 4 . e 0 2 4 % \n 0 " > \n
0000200 h t t p : / / w w w . o x f o r
0000220 d r e f e r e n c e . c o m / v
0000240 i e w s / E N T R Y . h t m l ?
0000260 s u b v i e w = M a i n & a m p
0000300 ; e n t r y = t 1 2 4 . e 0 2 4
0000320 % \n 0 < / a > < / t t >
0000334
----------
Can someone please tell me what I have done wrong and what I must do to extract
the above href automatically?
Many thanks in advance.
Chandra
03 Feb 08
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/