Jorgen Grahn <[EMAIL PROTECTED]> writes: [...] > - subclassed sgmllib.SGMLParser once for each kind of page I expected to > receive. This class knew how to pull the information from a HTML document, > provided it looked as I expected it to. Very tedious work. It can be easier > and safer to just use module re in some cases. [...]
BeautifulSoup is often recommended (never tried it myself). Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are available for cleaning horrid HTML while-u-scrape, too. Alternatively, some people swear by automating Internet Explorer; other people would rather be hit on the head with a ball-peen hammer (not only the MS-haters)... John -- http://mail.python.org/mailman/listinfo/python-list