Re: unicode, bytes redux

Fredrik Lundh Mon, 25 Sep 2006 10:56:01 -0700

willie wrote:

> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?
> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
> 
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
> 
> u = buf.decode('UTF-8')
> 
> # ... later ...
> 
> u.bytes() -> 3
> 
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)


what about:

     buf = "\xE2\x9C\x8C"
     bytes = buf.decode("utf-8")

     # ... later ...

     print bytes -> 3

or even

     class utf8string(unicode):
        def __new__(cls, data):
            return unicode.__new__(cls, data, "utf-8")
        def __init__(self, data):
            self.bytes = len(data)

     buf = "\xE2\x9C\x8C"

     u = utf8string(buf)

     # ... later ...

     print repr(u)
     print u.bytes -> 3

</F>

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode, bytes redux

Reply via email to