willie <[EMAIL PROTECTED]> writes: > # U+270C > # 11100010 10011100 10001100 > buf = "\xE2\x9C\x8C" > u = buf.decode('UTF-8') > # ... later ... > u.bytes() -> 3 > > (goes through each code point and calculates > the number of bytes that make up the character > according to the encoding)
Duncan Booth explains why that doesn't work. But I don't see any big problem with a byte count function that lets you specify an encoding: u = buf.decode('UTF-8') # ... later ... u.bytes('UTF-8') -> 3 u.bytes('UCS-4') -> 4 That avoids creating a new encoded string in memory, and for some encodings, avoids having to scan the unicode string to add up the lengths. -- http://mail.python.org/mailman/listinfo/python-list