In article†<[EMAIL PROTECTED]>,‬ †[EMAIL PROTECTED] wroteâ€Â:‬
â€Â> ‬Hiâ€Â,‬ â€Â> ‬I'm trying to get wikipedia page source with urllib2â€Â:‬ â€Â> ‬usock†= ‬urllib2â€Â.‬urlopenâ€Â("‬httpâ€Â://‬en.wikipedia.org/wikiâ€Â/‬ â€Â> ‬Albert_Einsteinâ€Â")‬ â€Â> ‬data†= ‬usock.readâ€Â();‬ â€Â> ‬usock.closeâ€Â();‬ â€Â> ‬return data â€Â> ‬I got exception because HTTP 403†‬errorâ€Â. ‬whyâ€Â? ‬with my browser i can't â€Â> ‬access it without any problemâ€Â?‬ â€Â> ‬ â€Â> ‬Thanksâ€Â,‬ â€Â> ‬Shaharâ€Â.‬ It appears that Wikipedia may inspect the contents of the User-Agent†‬ HTTP headerâ€Â, ‬and that it does not particularly like the string it†‬ receives from Python's urllibâ€Â. ‬I was able to make it work with urllib†‬ via the following codeâ€Â:‬ import urllib class CustomURLopener†(‬urllib.FancyURLopenerâ€Â):‬ †‬version†= '‬Mozilla/5.0â€Â'‬ urllibâ€Â.‬_urlopener†= ‬CustomURLopenerâ€Â()‬ u†= ‬urllib.urlopenâ€Â('‬httpâ€Â://‬en.wikipedia.org/wiki/Albert_Einsteinâ€Â')‬ data†= ‬u.readâ€Â()‬ I'm assuming a similar trick could be used with urllib2â€Â, ‬though I didn't†‬ actually try itâ€Â. ‬Another thing to watch out forâ€Â, ‬is that some sites†‬ will redirect a public URL X to an internal URL Yâ€Â, ‬and will check that†‬ access to Y is only permitted if the Referer field indicates coming from†‬ somewhere internal to the siteâ€Â. ‬I have seen both of these techniques†‬ used to foil screen-scrapingâ€Â.‬ Cheersâ€Â,‬ â€Â-‬M â€Â-- ‬ Michael Jâ€Â. ‬Fromberger†| ‬Lecturerâ€Â, ‬Deptâ€Â. ‬of Computer Science httpâ€Â://‬www.dartmouth.eduâ€Â/‬~stingâ€Â/ | ‬Dartmouth Collegeâ€Â, ‬Hanoverâ€Â, ‬NHâ€Â, ‬USA
-- http://mail.python.org/mailman/listinfo/python-list