Duncan Booth <[EMAIL PROTECTED]> writes: > Gabriel Zachmann wrote: > > > Here is a very simple Python script utilizing urllib: [...] > > "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi > > cal" > > print url > > print > > file = urllib.urlopen( url ) [...] > > However, when i ecexute it, i get an html error ("access denied"). > > > > On the one hand, the funny thing though is that i can view the page > > fine in my browser, and i can download it fine using curl. [...] > > > > On the other hand, it must have something to do with the URL because > > urllib works fine with any other URL i have tried ... > > > It looks like wikipedia checks the User-Agent header and refuses to send > pages to browsers it doesn't like. Try: [...]
If wikipedia is trying to discourage this kind of scraping, it's probably not polite to do it. (I don't know what wikipedia's policies are, though) John -- http://mail.python.org/mailman/listinfo/python-list