On Tue, 12 Oct 2010 06:28:23 -0700 (PDT) jmfauth <wxjmfa...@gmail.com> wrote:
> I hope my understanding is correct and I'm not dreaming. > > When an endianess is not specified, (BE, LE, unmarked forms), > the Unicode Consortium specifies, the default byte serialization > should be big-endian. > [...] > > It appears Python is just working in the opposite way. > [...] > >>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le')) > True Python uses the host's endianness by default. So, on a little-endian machine, utf-16 and utf-32 will use little-endian encoding. While decoding, though, the BOM is read by both of these codecs, so there should be no interoperability problems: >>> '\xff\xfea\x00b\x00c\x00'.decode('utf-16') u'abc' >>> '\xfe\xff\x00a\x00b\x00c'.decode('utf-16') u'abc' (do note, though, that the explicit utf*-be and utf*-le variants do not add a BOM) Regards Antoine. -- http://mail.python.org/mailman/listinfo/python-list