Re: unicode, bytes redux

Paul Rubin Mon, 25 Sep 2006 00:50:54 -0700

willie <[EMAIL PROTECTED]> writes:
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
> u = buf.decode('UTF-8')
> # ... later ...
> u.bytes() -> 3
> 
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)


Duncan Booth explains why that doesn't work.  But I don't see any big
problem with a byte count function that lets you specify an encoding:

     u = buf.decode('UTF-8')
     # ... later ...
     u.bytes('UTF-8') -> 3
     u.bytes('UCS-4') -> 4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode, bytes redux

Reply via email to