Re: Parsing HTML, extracting text and changing attributes.

Jay Loden Mon, 18 Jun 2007 09:38:57 -0700

Neil Cerutti wrote:
> You could get good results, and save yourself some effort, using
> links or lynx with the command line options to dump page text to
> a file. Python would still be needed to automate calling links or
> lynx on all your documents.


OP was looking for a way to parse out part of the file and apply classes to 
certain types of tags. Using lynx/links wouldn't help, since the output of 
links or lynx is going to end up as plain text and the desire isn't to strip 
all the formatting. 

Someone else mentioned lxml but as I understand it lxml will only work if it's 
valid XHTML that they're working with. Assuming it's not (since real-world HTML 
almost never is), perhaps BeautifulSoup will fare better. 

http://www.crummy.com/software/BeautifulSoup/documentation.html

-Jay
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML, extracting text and changing attributes.

Reply via email to