webspider, regexp not working, why?

notnorwegian Fri, 23 May 2008 09:46:16 -0700

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
{1}
([\w\-]+\.)+
([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?
(&
\w+=\w+)*)?")


why isnt this url catching something like:

<link rel="alternate" type="application/rss+xml" title="Python
Screencasts"
        href="http://www.showmedo.com/latestVideoFeed/rss2.0?
tag=python" />

site = urllib.urlopen("http://www.python.org";)
for row in site:
    obj = url.search(row)
    if obj != None:
        print "url: ", obj.group()

i know it works because it can catch
www.hello.com in a txt-file and i can catch emails of websites with
another regexp.

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

i see now that it has to match the beginning of the row or something,
because:
hi www.google.com
doesnt match but
www.google.com  hi
matches.


i though a regexp would search a row/file and when it finds an
occurence report it, so a regexp of "lo" would match in lopez.
--
http://mail.python.org/mailman/listinfo/python-list

webspider, regexp not working, why?

Reply via email to