On 21-08-2010 14:46, mdipierro wrote: > what do you find that is strange? This is the result with the last letter removed, so all links should give an error, but they differ with the 2 methods, and some of them produce 200, while they are definitely wrong 404 500 http://127.0.0.1:8000/welcome/default/user/logi 404 500 http://127.0.0.1:8000/welcome/default/user/registe 404 500 http://127.0.0.1:8000/welcome/default/user/request_reset_passwor 200 500 http://127.0.0.1:8000/welcome/default 400 500 http://127.0.0.1:8000/welcome/default/inde 200 500 http://127.0.0.1:8000/admin/default/design/welcom 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/controllers/default.p 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/views/default/index.htm 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/views/layout.htm 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/static/base.cs 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/models/db.p 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/models/menu.p 400 500 http://127.0.0.1:8000/welcome/appadmin/inde 200 500 http://127.0.0.1:8000/admin/default/inde 400 400 http://127.0.0.1:8000/examples/default/inde 200 -1 http://web2py.co 400 400 http://web2py.com/boo 400 500 http://127.0.0.1:8000/welcome/default/inde 200 500 http://127.0.0.1:8000/welcome/default 200 500 http://127.0.0.1:8000/admin/default/peek/welcome/controllers/default.p 200 500 http://127.0.0.1:8000/admin/default/peek/welcome/views/default/index.htm 200 -1 http://www.web2py.co
This is the normal result 200 500 http://127.0.0.1:8000/welcome/default/user/login 200 500 http://127.0.0.1:8000/welcome/default/user/register 200 500 http://127.0.0.1:8000/welcome/default/user/request_reset_password 200 500 http://127.0.0.1:8000/welcome/default 200 500 http://127.0.0.1:8000/welcome/default/index 200 500 http://127.0.0.1:8000/admin/default/design/welcome 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/controllers/default.py 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/views/default/index.html 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/views/layout.html 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/static/base.css 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/models/db.py 200 500 http://127.0.0.1:8000/admin/default/edit/welcome/models/menu.py 200 500 http://127.0.0.1:8000/welcome/appadmin/index 200 500 http://127.0.0.1:8000/admin/default/index 200 200 http://127.0.0.1:8000/examples/default/index 200 200 http://web2py.com 200 500 http://web2py.com/book 200 500 http://127.0.0.1:8000/welcome/default/index 400 500 http://127.0.0.1:8000/welcome/default/index# 200 500 http://127.0.0.1:8000/admin/default/peek/welcome/controllers/default.py 200 500 http://127.0.0.1:8000/admin/default/peek/welcome/views/default/index.html 200 200 http://www.web2py.com So when is a URL valid ? thanks, Stef > On Aug 21, 7:32 am, Stef Mientki <stef.mien...@gmail.com> wrote: >>> Graphical representation of links or pages that don't get linked to. >> I tried to test the links (with 2 algorithms, code below) in a generated >> webpage, but the result I >> get are very weird. >> Probably one you knows a better way ? >> >> cheers, >> Stef >> >> from BeautifulSoup import BeautifulSoup >> from urllib import urlopen >> from httplib import HTTP >> from urlparse import urlparse >> >> def Check_URL_1 ( URL ) : >> try: >> fh = urlopen ( URL ) >> return fh.code == 200 >> except : >> return False >> >> def Check_URL_2 ( URL ) : >> p = urlparse ( URL ) >> h = HTTP ( p[1] ) >> h.putrequest ( 'HEAD', p[2] ) >> h.endheaders() >> if h.getreply()[0] == 200: >> return True >> else: >> return False >> >> def Verify_Links ( URL ) : >> Parts = URL.split('/') >> Site = '/'.join ( Parts [:3] ) >> Current = '/'.join ( Parts [:-1] ) >> >> fh = urlopen ( URL ) >> lines = fh.read () >> fh.close() >> >> Soup = BeautifulSoup ( lines ) >> hrefs = lines = Soup.findAll ( 'a' ) >> >> for href in hrefs : >> href = href [ 'href' ] #[:-1] ## <== remove "#" to generate all >> errors >> >> if href.startswith ( '/' ) : >> href = Site + href >> elif href.startswith ('#' ) : >> href = URL + href >> elif href.startswith ( 'http' ) : >> pass >> else : >> href = Current + href >> >> try: >> fh = urllib.urlopen ( href ) >> except : >> pass >> print Check_URL_1 ( href ), Check_URL_2 ( href ), href >> >> URL = 'http://127.0.0.1:8000/welcome/default/index' >> fh = Verify_Links ( URL )