I hope my understanding is correct and I'm not dreaming. When an endianess is not specified, (BE, LE, unmarked forms), the Unicode Consortium specifies, the default byte serialization should be big-endian.
See http://www.unicode.org/faq//utf_bom.html Q: Which of the UTFs do I need to support? and Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE? (+ technical papers) It appears Python is just working in the opposite way. >>> sys.version 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] >>> repr(u'abc'.encode('utf-16-le')) 'a\x00b\x00c\x00' >>> repr(u'abc'.encode('utf-16-be')) '\x00a\x00b\x00c' >>> repr(u'abc'.encode('utf-16')) '\xff\xfea\x00b\x00c\x00' >>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be')) False >>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le')) True Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2 I attempted to find some precise discussions on that subject and I failed. Any thougths? -- http://mail.python.org/mailman/listinfo/python-list