Jorgen Grahn <[EMAIL PROTECTED]> writes:
> - subclassed sgmllib.SGMLParser once for each kind of page I expected to
>   receive. This class knew how to pull the information from a HTML document,
>   provided it looked as I expected it to.  Very tedious work. It can be easier
>   and safer to just use module re in some cases.

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...


Reply via email to