Andrew Fong wrote: > I need to ... > > 1) Truncate long unicode (UTF-8) strings based on their length in > BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but > takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate > (u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to > u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'. > > 2) I don't want to accidentally chop any unicode characters in half. > If the byte truncate length would normally cut a unicode character in > 2, then I just want to drop the whole character, not leave an orphaned > byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as > opposed to getting UnicodeDecodeError. > > I'm using Python2.6, so I have access to things like bytearray. Are > there any built-in ways to do something like this already? Or do I > just have to iterate over the unicode string?
How about >>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore") u'\xe4\xf6' >>> print _ äö Peter -- http://mail.python.org/mailman/listinfo/python-list