Sakcee wrote: > Hi > > In one of the data files that I have , I am seeing these characters > \xed\xa0\xa0 . They seem to break the xsl. [...] > is this a unicode utf-16 surrogate pair ?
Yes and no. This is the UTF-8 encoding of U+D820, which is a high surrogate code point. So yes. It's not yet a pair; there would have to be a second such code point. So no. Furthermore, in UTF-8, you should never ever have encoded surrogate codes; instead, whoever generated the UTF-8 should have combined the two surrogate code point into a single coded character, and should have encoded *that* character. So no - this byte sequence isn't even valid UTF-8. > for displaying it on xml/xsl, should I extract only \xa0? You should tell your parser to reject the file as ill-formed. > since this is hingher than 00-7f range can i just strip it? Depending an what you want to achieve: sure! It will modify the meaning of the bytes, of course. > under what condition the encoding software put this string in? If it has a bug. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list