Steven D'Aprano wrote: > On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > >> willie <[EMAIL PROTECTED]> writes: >>> # U+270C >>> # 11100010 10011100 10001100 >>> buf = "\xE2\x9C\x8C" >>> u = buf.decode('UTF-8') >>> # ... later ... >>> u.bytes() -> 3 >>> >>> (goes through each code point and calculates >>> the number of bytes that make up the character >>> according to the encoding) >> Duncan Booth explains why that doesn't work. But I don't see any big >> problem with a byte count function that lets you specify an encoding: >> >> u = buf.decode('UTF-8') >> # ... later ... >> u.bytes('UTF-8') -> 3 >> u.bytes('UCS-4') -> 4 >> >> That avoids creating a new encoded string in memory, and for some >> encodings, avoids having to scan the unicode string to add up the >> lengths. > > Unless I'm misunderstanding something, your bytes code would have to > perform exactly the same algorithmic calculations as converting the > encoded string in the first place, except it doesn't need to store the > newly encoded string, merely the number of bytes of each character. > > Here is a bit of pseudo-code that might do what you want: > > def bytes(unistring, encoding): > length = 0 > for c in unistring: > length += len(c.encode(encoding)) > return length
That wouldn't work for stateful encodings: >>> len(u"abc".encode("utf-16")) 8 >>> bytes(u"abc", "utf-16") 12 Use a stateful encoder instead: import codecs def bytes(unistring, encoding): length = 0 enc = codecs.getincrementalencoder(encoding)() for c in unistring: length += len(enc.encode(c)) return length Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list