Jorgen Grahn <[EMAIL PROTECTED]> writes: [...] > I did it this way successfully once ... it's probably the wrong approach in > some ways, but It Works For Me. > > - used httplib.HTTPConnection for the HTTP parts, building my own requests > with headers and all, calling h.send() and h.getresponse() etc. > > - created my own cookie container class (because there was a session > involved, and logging in and such things, and all of it used cookies) > > - subclassed sgmllib.SGMLParser once for each kind of page I expected to > receive. This class knew how to pull the information from a HTML document, > provided it looked as I expected it to. Very tedious work. It can be easier > and safer to just use module re in some cases. > > Wrapped in classes this ended up as (fictive): > > client = Client('somehost:80) > client.login('me', 'secret) > a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire') > foo = theFoo(client, 'yesterday') > > I had to look deeply into the HTTP RFCs to do this, and also snoop the > traffic for a "real" session to see what went on between server and client.
I see little benefit and significant loss in using httplib instead of urllib2, unless and until you get a particulary stubborn problem and want to drop down a level to debug. It's easy to see and modify urllib2's headers if you need to get low level. One starting point for web scraping with Python: http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html There are some modules you may find useful there, too. Google Groups for urlencode. Or use my module ClientForm, if you prefer. Experiment a little with an HTML form in a local file and (eg.) the 'ethereal' sniffer to see what happens when you click submit. The stdlib now has cookie support (in Python 2.4): import cookielib, urllib2 cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) r = opener.open("http://example.com/") print r.read() Unfortunately, it's true that network sniffing and a reasonable smattering of knowledge about HTTP &c., does often turn out to be necessary to scrape stuff. A few useful tips: http://wwwsearch.sourceforge.net/ClientCookie/doc.html#debugging John -- http://mail.python.org/mailman/listinfo/python-list