On Tue, May 12, 2009 at 3:10 PM, <rump...@web.de> wrote: >> You can certainly have a string type that uses byte arrays in UTF-8 >> encoding internally, but your string functions should be aware of that >> and treat it as a unicode string. The len function and index operators >> should count characters, not bytes. Add a byte array data type for >> byte arrays instead. >> > It's not easy. I think Python3's byte arrays have an "upper" method > (and a string literal syntax b"abc") which is quite alarming to me > that they chose the wrong default.
I suppose that is to make it possible to use the 'bytes' data type for text strings if you really want to (and for backwards-compatibility). Default text strings should use Unicode (as in Python 3), and that should be supported by the language. > Eventually the "rope" data structure (that the compiler uses heavily) > will become a proper part of the library: By "rope" I mean an > immutable string implemented as a tree, so concatenation is O(1). For > immutable strings there is no ``[]=`` operation, so using UTF-8 and > converting it to a 32bit char works better. Consider a string class that keeps track of its own encoding and can change it on the fly as needed. -- mar...@librador.com http://www.librador.com -- http://mail.python.org/mailman/listinfo/python-list