[EMAIL PROTECTED] wrote: > I have a problem when I'm processing unicode strings. Is it possible > to get the 8bit-string representation of any unicode string? > > Suppose I get a unicode string: > a = u'\xc8\xce\xcf\xcd\xc6\xeb'; > then, by > a.encode('latin-1'); > I can get the 8bit-string representation of it, that is, the physical > storage format of this string. > > But for another kind of unicode string, say: > b = u'\u4efb\u8d24\u9f50'; > I have to: > b.encode('utf-8') > to get the 8bit-string format of it.
latin-1 and utf-8 are two different 8-bit representations (encodings) of Unicode. > Since these unicode strings are given by an external library function, > I don't know which kind a unicode string belongs to before I get it at > runtime. So, I wonder if there is a unified way to get the 8bit-string > representation, say, byte-by-byte, of any unicode string? since the Unicode character set contains 1.1 million code points, and a single byte can contain 256 different values, it should be fairly obvious that there's no "8 bit byte by byte" representation of a Unicode string. you need to decide what 8-bit encoding to use, and stick to that. </F> -- http://mail.python.org/mailman/listinfo/python-list