On Thu, 28 May 2009 08:50:00 -0700, Andrew Fong wrote: > I need to ... > > 1) Truncate long unicode (UTF-8) strings based on their length in BYTES.
Out of curiosity, why do you need to do this? > For example, u'\u4000\u4001\u4002 abc' has a length of 7 but takes up 13 > bytes. No, that's wrong. The number of bytes depends on the encoding, it's not a property of the unicode string itself. >>> s = u'\u4000\u4001\u4002 abc' >>> len(s) # characters 7 >>> len(s.encode('utf-8')) # bytes 13 >>> len(s.encode('utf-16')) # bytes 16 >>> len(s.encode('U32')) # bytes 32 > Since u'\u4000' takes up 3 bytes But it doesn't. The *encoded* unicode character *may* take up three bytes, or four, or possibly more, depending on what encoding you use. -- Steven -- http://mail.python.org/mailman/listinfo/python-list