Andrew Fong <FongAndrew <at> gmail.com> writes:
> I need to ... > 1) Truncate long unicode (UTF-8) strings based on their length in > BYTES. > 2) I don't want to accidentally chop any unicode characters in half. > If the byte truncate length would normally cut a unicode character in > 2, then I just want to drop the whole character, not leave an orphaned > byte. > I'm using Python2.6, so I have access to things like bytearray. Using bytearray saves you from using ord() but runs the risk of accidental mutation. > Are > there any built-in ways to do something like this already? Or do I > just have to iterate over the unicode string? Converting each character to utf8 and checking the total number of bytes so far? Ooooh, sloooowwwwww! The whole concept of "truncating unicode" you mean "truncating utf8") seems rather unpythonic to me. Another alternative is to iterate backwards over the utf8 string looking for a character-starting byte. It leads to a candidate for Unpythonic Code of the Year: def utf8trunc(u8s, maxlen): assert maxlen >= 1 alen = len(u8s) if alen <= maxlen: return u8s pos = maxlen - 1 while pos >= 0: val = ord(u8s[pos]) if val & 0xC0 != 0x80: # found an initial byte break pos -= 1 else: # no initial byte found raise ValueError("malformed UTF-8 [1]") if maxlen - pos > 4: raise ValueError("malformed UTF-8 [2]") if val & 0x80: charlen = (2, 2, 3, 4)[(val >> 4) & 3] else: charlen = 1 nextpos = pos + charlen assert nextpos >= maxlen if nextpos == maxlen: return u8s[:nextpos] return u8s[:pos] if __name__ == "__main__": tests = [u"", u"\u0000", u"\u007f", u"\u0080", u"\u07ff", u"\u0800", u"\uffff" ] for testx in tests: test = u"abcde" + testx + u"pqrst" u8 = test.encode('utf8') print repr(test), repr(u8), len(u8) for mlen in range(4, 8 + len(testx.encode('utf8'))): u8t = utf8trunc(u8, mlen) print " ", mlen, len(u8t), repr(u8t) Tested to the extent shown. Doesn't pretend to check for all cases of UTF-8 malformation, just easy ones :-) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list