willie wrote: > Is it too ridiculous to suggest that it'd be nice > if the unicode object were to remember the > encoding of the string it was decoded from? > So that it's feasible to calculate the number > of bytes that make up the unicode code points. > > # U+270C > # 11100010 10011100 10001100 > buf = "\xE2\x9C\x8C" > > u = buf.decode('UTF-8') > > # ... later ... > > u.bytes() -> 3 > > (goes through each code point and calculates > the number of bytes that make up the character > according to the encoding)
what about: buf = "\xE2\x9C\x8C" bytes = buf.decode("utf-8") # ... later ... print bytes -> 3 or even class utf8string(unicode): def __new__(cls, data): return unicode.__new__(cls, data, "utf-8") def __init__(self, data): self.bytes = len(data) buf = "\xE2\x9C\x8C" u = utf8string(buf) # ... later ... print repr(u) print u.bytes -> 3 </F> -- http://mail.python.org/mailman/listinfo/python-list