Re: Link Checking Issues - Sub domains

Terry Reedy Tue, 05 Aug 2008 12:18:59 -0700


rpupkin77 wrote:

Hi,

I have written this script to run as a cron that will loop through a
text file with a list of urls. It works fine for most of the links,
however there are a number of urls which are subdomains (they are
government sites) such as http://basename.airforce.mil, these links
are always throwing 400 errors even though the site exists.


Have you looked at urllib/urllib2 (urllib.request in 3.0)
for checking links?
If 'http://basename.airforce.mil' works typed into your browser,
this from the doc for urllib.request.Request might be relevant:

"headers should be a dictionary, and will be treated as if add_header()was called with each key and value as arguments. This is often used to“spoof” the User-Agent header, which is used by a browser to identifyitself – some HTTP servers only allow requests coming from commonbrowsers as opposed to scripts. For example, Mozilla Firefox mayidentify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127Firefox/2.0.0.11", while urllib‘s default user agent string is"Python-urllib/2.6" (on Python 2.6)."



--
http://mail.python.org/mailman/listinfo/python-list

Re: Link Checking Issues - Sub domains

Reply via email to