Re�: �get wikipedia source failed� (�urrlib2�)�

Michael J��. ��Fromberger Tue, 07 Aug 2007 08:22:26 -0700

In articleâ <[EMAIL PROTECTED]>,â¬
â [EMAIL PROTECTED] wroteâ:â¬


â> â¬Hiâ,â¬
â> â¬I'm trying to get wikipedia page source with urllib2â:â¬
â>     â¬usockâ = 
â¬urllib2â.â¬urlopenâ("â¬httpâ://â¬en.wikipedia.org/wikiâ/â¬
â> â¬Albert_Einsteinâ")â¬
â>     â¬dataâ = â¬usock.readâ();â¬
â>     â¬usock.closeâ();â¬
â>     â¬return data
â> â¬I got exception because HTTP 403â â¬errorâ. â¬whyâ? â¬with my 
browser i can't
â> â¬access it without any problemâ?â¬
â> â¬
â> â¬Thanksâ,â¬
â> â¬Shaharâ.â¬

It appears that Wikipedia may inspect the contents of the User-Agentâ â¬
HTTP headerâ, â¬and that it does not particularly like the string itâ â¬
receives from Python's urllibâ.  â¬I was able to make it work with urllibâ 
â¬
via the following codeâ:â¬

import urllib

class CustomURLopenerâ (â¬urllib.FancyURLopenerâ):â¬
â  â¬versionâ = 'â¬Mozilla/5.0â'â¬

urllibâ.â¬_urlopenerâ = â¬CustomURLopenerâ()â¬

uâ = 
â¬urllib.urlopenâ('â¬httpâ://â¬en.wikipedia.org/wiki/Albert_Einsteinâ')â¬
dataâ = â¬u.readâ()â¬

I'm assuming a similar trick could be used with urllib2â, â¬though I 
didn'tâ â¬
actually try itâ.  â¬Another thing to watch out forâ, â¬is that some 
sitesâ â¬
will redirect a public URL X to an internal URL Yâ, â¬and will check thatâ 
â¬
access to Y is only permitted if the Referer field indicates coming fromâ â¬
somewhere internal to the siteâ.  â¬I have seen both of these techniquesâ 
â¬
used to foil screen-scrapingâ.â¬

Cheersâ,â¬
â-â¬M

â-- â¬
Michael Jâ. â¬Frombergerâ             | â¬Lecturerâ, â¬Deptâ. â¬of 
Computer Science
httpâ://â¬www.dartmouth.eduâ/â¬~stingâ/  | â¬Dartmouth Collegeâ, 
â¬Hanoverâ, â¬NHâ, â¬USA

-- 
http://mail.python.org/mailman/listinfo/python-list

Re�: �get wikipedia source failed� (�urrlib2�)�

Reply via email to