On May 3, 2010, at 10:13 AM, andrew cooke wrote:

FYI, Fourthought's PyXML has a module called uri.py that contains
regexes for URL validation. I've over a million URLs (harvested from
the Internet) through their code. I can't say I checked each and every
result, but I never saw anything that would lead me to believe it was
misbehaving.

It might be interesting to compare the results of running a large list
of URLs through your code and theirs.

Good luck
Philip

It's getting a set of URLs that's the main problem.  I've tested it
with URL examples in RFC 3696, and with a few extra ones that test
particular issues, but when I looked around I couldn't find any
public, obvious list of URLs for general testing.  Could I use your
list?

Also, same for emails...

If I still had a list of URLs you'd be welcome to it. The list was generated as part of a spidering project that's long gone.

If all you want to do is generate a list of URLs and email addresses, you could cobble a robots.txt-respectful spider without too much trouble. As with so many things, it's just an SMOP [1]. =)

[1] - http://en.wikipedia.org/wiki/Small_matter_of_programming

bye
Philip


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to