Re: How to get the argument of the ?

Joshua Colson Tue, 18 Apr 2006 09:09:37 -0700

On Apr 18, 2006, at 2:51 AM, sfantar wrote:

I am at the moment trying to write on a script which gets the urlbetween " in the <a href="http...">. Unfortunately, I am not able toget them with my regexp.
My regexp just only matches the lines containing the urls.
After reading the docs about regexp, I can't get the urls themselves.
Thanks for you help.

What regex are you using that isn't working? Assuming you don't want toget fancy and use a module like HTML::TokeParser or something likethat, try this:


###################################################################

#!/usr/bin/perl

# Always use strict (and warnings)!
use strict;
use warnings;

my( @urls, $document );

$document = '
<html>
<head>
<title>This is an example HTML document</title>
</head>
<body>
<h1>Here is an example document</h1>

There are a few modules on <a href="http://www.cpan.org/";>CPAN</a> thatcan parse HTML documents, such as <ahref="http://search.cpan.org/~gaas/HTML-Parser-3.51/lib/HTML/TokeParser.pm">HTML::TokeParser</a>,or <ahref="http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/TokeParser/Simple.pm">HTML::TokeParser::Simple</a>.

</body>
</html>
';

# Split the document into seperate lines and process each line
foreach my $line ( split(/\n/, $document) ) {
  # $line now holds one line of the input document.
  # Search the line for URLs and add them to the @urls array
  push(@urls, $1) while $line =~ /href="(.*?)"/gi;
  # The previous line deserves some explanation. What is happening is:

# $1 (the pattern matched between the first set of parens) is'push'ed onto# the @urls array, but that only happens 'while' the $line matchesthe# regex. One very important part of this statement is the 'g'modifier on# the end of the regex, which mean match as many times as you can onthe# line. Without the 'g' modifier, only the first URL encountered oneach

  #  line would be put into the @urls array. The 'i' (case-insensitive)

# modifier is important because HTML does not require a specificcase for

  #  the 'href' part.
}

print join("\n", @urls);

__END__

##################################################################

Output:
http://www.cpan.org/
http://search.cpan.org/~gaas/HTML-Parser-3.51/lib/HTML/TokeParser.pm

http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/TokeParser/Simple.pm


I hope that helps you.

--
Joshua Colson <[EMAIL PROTECTED]>


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: How to get the argument of the ?

Reply via email to