In article‭ <[EMAIL PROTECTED]>,‬
‭ [EMAIL PROTECTED] wrote‭:‬

‭> ‬Hi‭,‬
‭> ‬I'm trying to get wikipedia page source with urllib2‭:‬
‭>     ‬usock‭ = 
‬urllib2‭.‬urlopen‭("‬http‭://‬en.wikipedia.org/wiki‭/‬
‭> ‬Albert_Einstein‭")‬
‭>     ‬data‭ = ‬usock.read‭();‬
‭>     ‬usock.close‭();‬
‭>     ‬return data
‭> ‬I got exception because HTTP 403‭ ‬error‭. ‬why‭? ‬with my 
browser i can't
‭> ‬access it without any problem‭?‬
‭> ‬
‭> ‬Thanks‭,‬
‭> ‬Shahar‭.‬

It appears that Wikipedia may inspect the contents of the User-Agent‭ ‬
HTTP header‭, ‬and that it does not particularly like the string it‭ ‬
receives from Python's urllib‭.  ‬I was able to make it work with urllib‭ 
‬
via the following code‭:‬

import urllib

class CustomURLopener‭ (‬urllib.FancyURLopener‭):‬
‭  ‬version‭ = '‬Mozilla/5.0‭'‬

urllib‭.‬_urlopener‭ = ‬CustomURLopener‭()‬

u‭ = 
‬urllib.urlopen‭('‬http‭://‬en.wikipedia.org/wiki/Albert_Einstein‭')‬
data‭ = ‬u.read‭()‬

I'm assuming a similar trick could be used with urllib2‭, ‬though I 
didn't‭ ‬
actually try it‭.  ‬Another thing to watch out for‭, ‬is that some 
sites‭ ‬
will redirect a public URL X to an internal URL Y‭, ‬and will check that‭ 
‬
access to Y is only permitted if the Referer field indicates coming from‭ ‬
somewhere internal to the site‭.  ‬I have seen both of these techniques‭ 
‬
used to foil screen-scraping‭.‬

Cheers‭,‬
‭-‬M

‭-- ‬
Michael J‭. ‬Fromberger‭             | ‬Lecturer‭, ‬Dept‭. ‬of 
Computer Science
http‭://‬www.dartmouth.edu‭/‬~sting‭/  | ‬Dartmouth College‭, 
‬Hanover‭, ‬NH‭, ‬USA
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to