On Oct 30, 6:44 pm, "一首诗" <[EMAIL PROTECTED]> wrote:
> Oh, I didn't make myself clear.
>
> What I mean is how to convert a piece of html to plain text bu keep as
> much format as possible.
>
> Such as convert "&nbsp;" to blank space and convert <br> to "\r\n"
>

Then you can explore the parser,
http://docs.python.org/lib/module-HTMLParser.html, like

#!/usr/bin/env python
from HTMLParser import HTMLParser

parsedtext = ''

class Parser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'br':
            global parsedtext
            parsedtext += '\\r\\n'

    def handle_data(self, data):
        global parsedtext
        parsedtext += data

    def handle_entityref(self, name):
        if name == 'nbsp':
            pass

x = Parser()
x.feed('An &nbsp; text<br>')
print parsedtext


> Gary Herron wrote:
> > 一首诗 wrote:
> > > Is there any simple way to solve this problem?
>
> > Yes, strings have a replace method:
>
> > >>> s = "abc&nbsp;def"
> > >>> s.replace('&nbsp;',' ')
> > 'abc def'
>
> > Also various modules that are meant to deal with web and xml and such
> > have functions to do such operations.
> 
> > Gary Herron

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to