Most pythonic way to truncate unicode?

Andrew Fong Thu, 28 May 2009 08:51:23 -0700

I need to ...

1) Truncate long unicode (UTF-8) strings based on their length in
BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but
takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate
(u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to
u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.


2) I don't want to accidentally chop any unicode characters in half.
If the byte truncate length would normally cut a unicode character in
2, then I just want to drop the whole character, not leave an orphaned
byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as
opposed to getting UnicodeDecodeError.

I'm using Python2.6, so I have access to things like bytearray. Are
there any built-in ways to do something like this already? Or do I
just have to iterate over the unicode string?

-- Andrew
-- 
http://mail.python.org/mailman/listinfo/python-list

Most pythonic way to truncate unicode?

Reply via email to