Re: Excess whitespace in my soup

2008-01-20 Thread John Machin
Remco Gerlich wrote: > Not sure if this is sufficient for what you need, but how about > > import re > re.sub(u'[\s\xa0]+', ' ', s) > > That should replace all occurances of 1 or more whitespace or \xa0 > characters, by a single space. > It does indeed, and so does re.sub(u'\s\+', ' ', s) beca

Re: Excess whitespace in my soup

2008-01-20 Thread John Machin
Stefan Behnel wrote: > John Machin wrote: > >> On Jan 19, 11:00 pm, Fredrik Lundh <[EMAIL PROTECTED]> wrote: >> >>> John Machin wrote: >>> I'm happy enough with reassembling the second item. The problem is in reliably and correctly collapsing the whitespace in each of the

Re: Excess whitespace in my soup

2008-01-19 Thread Stefan Behnel
John Machin wrote: > On Jan 19, 11:00 pm, Fredrik Lundh <[EMAIL PROTECTED]> wrote: >> John Machin wrote: >>> I'm happy enough with reassembling the second item. The problem is in >>> reliably and correctly collapsing the whitespace in each of the above >> > fiveelements. The standard Python idiom

Re: Excess whitespace in my soup

2008-01-19 Thread John Machin
On Jan 19, 11:00 pm, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > John Machin wrote: > > I'm happy enough with reassembling the second item. The problem is in > > reliably and correctly collapsing the whitespace in each of the above > > > fiveelements. The standard Python idiom of u' '.join(text.sp

Re: Excess whitespace in my soup

2008-01-19 Thread Remco Gerlich
Not sure if this is sufficient for what you need, but how about import re re.sub(u'[\s\xa0]+', ' ', s) That should replace all occurances of 1 or more whitespace or \xa0 characters, by a single space. Remco On Jan 19, 2008 12:38 PM, John Machin <[EMAIL PROTECTED]> wrote: > I'm trying to recove

Re: Excess whitespace in my soup

2008-01-19 Thread Fredrik Lundh
John Machin wrote: > I'm happy enough with reassembling the second item. The problem is in > reliably and correctly collapsing the whitespace in each of the above > fiveelements. The standard Python idiom of u' '.join(text.split()) > won't work because the text is Unicode and u'\xa0' is whitesp

Excess whitespace in my soup

2008-01-19 Thread John Machin
I'm trying to recover the original data from some HTML written by a well-known application. Here are three original data items, in Python repr() format, with spaces changed to tildes for clarity: u'Saturday,~19~January~2008' u'Line1\nLine2\nLine3' u'foonly~frabjous\xa0farnarklingliness' Here is