On Apr 18, 2006, at 2:51 AM, sfantar wrote:

I am at the moment trying to write on a script which gets the url between " in the <a href="http...">. Unfortunately, I am not able to get them with my regexp.
My regexp just only matches the lines containing the urls.
After reading the docs about regexp, I can't get the urls themselves.

Thanks for you help.

What regex are you using that isn't working? Assuming you don't want to get fancy and use a module like HTML::TokeParser or something like that, try this:

###################################################################

#!/usr/bin/perl

# Always use strict (and warnings)!
use strict;
use warnings;

my( @urls, $document );

$document = '
<html>
<head>
<title>This is an example HTML document</title>
</head>
<body>
<h1>Here is an example document</h1>
There are a few modules on <a href="http://www.cpan.org/";>CPAN</a> that can parse HTML documents, such as <a href="http://search.cpan.org/~gaas/HTML-Parser-3.51/lib/HTML/ TokeParser.pm">HTML::TokeParser</a>, or <a href="http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/ HTML/TokeParser/Simple.pm">HTML::TokeParser::Simple</a>.
</body>
</html>
';

# Split the document into seperate lines and process each line
foreach my $line ( split(/\n/, $document) ) {
  # $line now holds one line of the input document.
  # Search the line for URLs and add them to the @urls array
  push(@urls, $1) while $line =~ /href="(.*?)"/gi;
  # The previous line deserves some explanation. What is happening is:
# $1 (the pattern matched between the first set of parens) is 'push'ed onto # the @urls array, but that only happens 'while' the $line matches the # regex. One very important part of this statement is the 'g' modifier on # the end of the regex, which mean match as many times as you can on the # line. Without the 'g' modifier, only the first URL encountered on each
  #  line would be put into the @urls array. The 'i' (case-insensitive)
# modifier is important because HTML does not require a specific case for
  #  the 'href' part.
}

print join("\n", @urls);

__END__

##################################################################

Output:
http://www.cpan.org/
http://search.cpan.org/~gaas/HTML-Parser-3.51/lib/HTML/TokeParser.pm
http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/ TokeParser/Simple.pm

I hope that helps you.

--
Joshua Colson <[EMAIL PROTECTED]>


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to