I'm putting together a regex to pull all of the urls out of a web page. Not
the href tag, but just the url part of that tag.

Here's what I've come up with:

preg_match_all('/<.*href\s*=\s*(\"|\')?(.*?)(\s|\"|\'|>)/i', $html,
$matches);
foreach($matches[2] as $m) print "<P>$m\n";

All regex masters please tell me if I'm missing something. It's working
well, but I'm still learning about perl regex and I'd like any input if at
all possible.

What's a good way to exclude things like javascript: urls and other non URI
info? I guess what I'm really looking for is all the http urls, no ftp, mms
etc... or anything like that.

If it's right, then hopefully someone can use it!

Matt Friedman


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Reply via email to