robin <[EMAIL PROTECTED]> wrote:
> hi,
> i remember seeing this simple python function which would take raw html
> and output the content (body?) of the page as plain text (no <..> tags
> etc)
> i have been looking at htmllib and htmlparser but this all seems to
> complicated for what i'm looking for. i just need the main text in the
> body of some arbitrary webbpage to then do some natural-language
> processing with it...
> thanks for pointing me to some helpful resources!

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

-- 
 -----------------------------------------------------------
| Radovan GarabĂ­k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to