I need to ... 1) Truncate long unicode (UTF-8) strings based on their length in BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate (u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.
2) I don't want to accidentally chop any unicode characters in half. If the byte truncate length would normally cut a unicode character in 2, then I just want to drop the whole character, not leave an orphaned byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as opposed to getting UnicodeDecodeError. I'm using Python2.6, so I have access to things like bytearray. Are there any built-in ways to do something like this already? Or do I just have to iterate over the unicode string? -- Andrew -- http://mail.python.org/mailman/listinfo/python-list