On Apr 18, 2006, at 2:51 AM, sfantar wrote:
I am at the moment trying to write on a script which gets the url
between " in the <a href="http...">. Unfortunately, I am not able to
get them with my regexp.
My regexp just only matches the lines containing the urls.
After reading the docs about regexp, I can't get the urls themselves.
Thanks for you help.
What regex are you using that isn't working? Assuming you don't want to
get fancy and use a module like HTML::TokeParser or something like
that, try this:
###################################################################
#!/usr/bin/perl
# Always use strict (and warnings)!
use strict;
use warnings;
my( @urls, $document );
$document = '
<html>
<head>
<title>This is an example HTML document</title>
</head>
<body>
<h1>Here is an example document</h1>
There are a few modules on <a href="http://www.cpan.org/">CPAN</a> that
can parse HTML documents, such as <a
href="http://search.cpan.org/~gaas/HTML-Parser-3.51/lib/HTML/
TokeParser.pm">HTML::TokeParser</a>,
or <a
href="http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/
HTML/TokeParser/Simple.pm">HTML::TokeParser::Simple</a>.
</body>
</html>
';
# Split the document into seperate lines and process each line
foreach my $line ( split(/\n/, $document) ) {
# $line now holds one line of the input document.
# Search the line for URLs and add them to the @urls array
push(@urls, $1) while $line =~ /href="(.*?)"/gi;
# The previous line deserves some explanation. What is happening is:
# $1 (the pattern matched between the first set of parens) is
'push'ed onto
# the @urls array, but that only happens 'while' the $line matches
the
# regex. One very important part of this statement is the 'g'
modifier on
# the end of the regex, which mean match as many times as you can on
the
# line. Without the 'g' modifier, only the first URL encountered on
each
# line would be put into the @urls array. The 'i' (case-insensitive)
# modifier is important because HTML does not require a specific
case for
# the 'href' part.
}
print join("\n", @urls);
__END__
##################################################################
Output:
http://www.cpan.org/
http://search.cpan.org/~gaas/HTML-Parser-3.51/lib/HTML/TokeParser.pm
http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/
TokeParser/Simple.pm
I hope that helps you.
--
Joshua Colson <[EMAIL PROTECTED]>
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>