On Fri, Jun 17, 2011 at 6:19 PM, gervaz <ger...@gmail.com> wrote: > The fact is that I have a list of urls and I wanted to retrieve the > minimum necessary information in order to understand if the link is a > valid html page or e.g. a picture or something else. As far as I > understood here http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html > the HEAD command is the one that let you do this. But it seems it > doesn't work.
It's not working because of a few issues. Twitter doesn't accept requests that come without a Host: header, so you'll need to provide that. Also, your "HTTP 1.0" is going as the body of the request, which is quite unnecessary. What you were getting was a 301 redirect, as you can confirm thus: >>> r.getcode() 301 >>> r.getheaders() [('Date', 'Fri, 17 Jun 2011 08:31:31 GMT'), ('Server', 'Apache'), ('Location', 'http://twitter.com/'), ('Cache-Control', 'max-age=300'), ('Expires', 'Fri, 17 Jun 2011 08:36:31 GMT'), ('Vary', 'Accept-Encoding'), ('Connection', 'close'), ('Content-Type', 'text/html; charset=iso-8859-1')] (Note the Location header - the server's asking you to go to twitter.com by name.) h.request("HEAD","/",None,{"Host":"twitter.com"}) Now we have a request that the server's prepared to answer: >>> r.getcode() 200 The headers are numerous, so I won't quote them here, but you get a Content-Length which tells you the size of the page that you would get, plus a few others that may be of interest. But note that there's still no body on a HEAD request: >>> r.read() b'' If you want to check validity, the most important part is the code: >>> h.request("HEAD","/aasdfadefa",None,{"Host":"twitter.com"}) >>> r=h.getresponse() >>> r.getcode() 404 Twitter might be a bad example for this, though, as the above call will succeed if there is a user of that name (for instance, replacing "/aasdfadefa" with "/rosuav" changes the response to a 200). You also have to contend with the possibility that the server won't allow HEAD requests at all, in which case just fall back on GET. But all this isn't certain, even so. There are some misconfigured servers that actually send a 200 response when a page doesn't exist. But you can probably ignore those sorts of hassles, and just code to the standard. Hope that helps! Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list