I'm putting together a regex to pull all of the urls out of a web page. Not
the href tag, but just the url part of that tag.
Here's what I've come up with:
preg_match_all('/<.*href\s*=\s*(\"|\')?(.*?)(\s|\"|\'|>)/i', $html,
$matches);
foreach($matches[2] as $m) print "<P>$m\n";
All regex masters please tell me if I'm missing something. It's working
well, but I'm still learning about perl regex and I'd like any input if at
all possible.
What's a good way to exclude things like javascript: urls and other non URI
info? I guess what I'm really looking for is all the http urls, no ftp, mms
etc... or anything like that.
If it's right, then hopefully someone can use it!
Matt Friedman
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]