On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > willie <[EMAIL PROTECTED]> writes: >> # U+270C >> # 11100010 10011100 10001100 >> buf = "\xE2\x9C\x8C" >> u = buf.decode('UTF-8') >> # ... later ... >> u.bytes() -> 3 >> >> (goes through each code point and calculates >> the number of bytes that make up the character >> according to the encoding) > > Duncan Booth explains why that doesn't work. But I don't see any big > problem with a byte count function that lets you specify an encoding: > > u = buf.decode('UTF-8') > # ... later ... > u.bytes('UTF-8') -> 3 > u.bytes('UCS-4') -> 4 > > That avoids creating a new encoded string in memory, and for some > encodings, avoids having to scan the unicode string to add up the > lengths.
Unless I'm misunderstanding something, your bytes code would have to perform exactly the same algorithmic calculations as converting the encoded string in the first place, except it doesn't need to store the newly encoded string, merely the number of bytes of each character. Here is a bit of pseudo-code that might do what you want: def bytes(unistring, encoding): length = 0 for c in unistring: length += len(c.encode(encoding)) return length At the cost of some speed, you can avoid storing the entire encoded string in memory, which might be what you want if you are dealing with truly enormous unicode strings. Alternatively, instead of calling encode() on each character, you can write a function (presumably in C for speed) that does the exact same thing as encode, but without storing the encoded characters, merely adding their lengths. Now you have code duplication, which is usually a bad idea. If for no other reason, some poor schmuck has to maintain them both! (And I bet it won't be Willie, for all his enthusiasm for the idea.) This whole question seems to me like an awful example of premature optimization. Your computer has probably got well in excess of 100MB, and you're worried about duplicating a few hundred or thousand (or even hundred thousand) bytes for a few milliseconds (just long enough to grab the length)? -- Steven D'Aprano -- http://mail.python.org/mailman/listinfo/python-list