Jorgen Grahn <[EMAIL PROTECTED]> writes:
[...]
> - subclassed sgmllib.SGMLParser once for each kind of page I expected to
>   receive. This class knew how to pull the information from a HTML document,
>   provided it looked as I expected it to.  Very tedious work. It can be easier
>   and safer to just use module re in some cases.
[...]

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...


John
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to