Not sure if this is sufficient for what you need, but how about import re re.sub(u'[\s\xa0]+', ' ', s)
That should replace all occurances of 1 or more whitespace or \xa0 characters, by a single space. Remco On Jan 19, 2008 12:38 PM, John Machin <[EMAIL PROTECTED]> wrote: > I'm trying to recover the original data from some HTML written by a > well-known application. > > Here are three original data items, in Python repr() format, with > spaces changed to tildes for clarity: > > u'Saturday,~19~January~2008' > u'Line1\nLine2\nLine3' > u'foonly~frabjous\xa0farnarklingliness' > > Here is the HTML, with spaces changed to tildes, angle brackets > changed to square brackets, > omitting \r\n from the end of each line, and stripping a large number > of attributes from the [td] tags. > > ~~[td]Saturday,~19 > ~~January~2008[/td] > ~~[td]Line1[br] > ~~~~Line2[br] > ~~~~Line3[/td] > ~~[td]foonly > ~~frabjous farnarklingliness[/td] > > Here are the results of feeding it to ElementSoup: > > >>> import ElementSoup as ES > >>> elem = ES.parse('ws_soup1.htm') > >>> from pprint import pprint as pp > >>> pp([(e.tag, e.text, e.tail) for e in elem.getiterator()]) > [snip] > (u'td', u'Saturday, 19\n January 2008', u'\n'), > (u'td', u'Line1', u'\n'), > (u'br', None, u'\n Line2'), > (u'br', None, u'\n Line3'), > (u'td', u'foonly\n frabjous\xa0farnarklingliness', u'\n')] > > I'm happy enough with reassembling the second item. The problem is in > reliably and > correctly collapsing the whitespace in each of the above five > elements. The standard Python > idiom of u' '.join(text.split()) won't work because the text is > Unicode and u'\xa0' is whitespace > and would be converted to a space. > > Should whitespace collapsing be done earlier? Note that BeautifulSoup > leaves it as -- ES does the conversion to \xa0 ... > > Does anyone know of an html_collapse_whitespace() for Python? Am I > missing something obvious? > > Thanks in advance, > John > -- > http://mail.python.org/mailman/listinfo/python-list >
-- http://mail.python.org/mailman/listinfo/python-list