robin <[EMAIL PROTECTED]> wrote: > hi, > i remember seeing this simple python function which would take raw html > and output the content (body?) of the page as plain text (no <..> tags > etc) > i have been looking at htmllib and htmlparser but this all seems to > complicated for what i'm looking for. i just need the main text in the > body of some arbitrary webbpage to then do some natural-language > processing with it... > thanks for pointing me to some helpful resources!
text=re.sub(r'(?s)\<.+?\>', '', html_text) (this will keep html entities, though) -- ----------------------------------------------------------- | Radovan GarabĂk http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list