On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote: > Ian Foote: > >> Specifically, indexing a variable-length encoding like utf-8 is not as >> efficient as indexing a fixed-length encoding. > > Many common string operations do not require indexing by character > which reduces the impact of this inefficiency.
Which common string operations do you have in mind? Specifically in Python's case, the most obvious counter-example is the length of a string. But that's only because Python strings are immutable objects, and include a field that records the length. So once the string is created, checking its length takes constant time. Some string operations need to inspect every character, e.g. str.upper(). Even for them, the increased complexity of a variable-width encoding costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or 4 bytes per character. You have to walk the string grabbing 1 byte at a time, and then decide whether you need another 1, 2 or 3 bytes. Even though it's still O(N), the added bit-masking and overhead of variable- width encoding adds to the overall cost. Any string method that takes a starting offset requires the method to walk the string byte-by-byte. I've even seen languages put responsibility for dealing with that onto the programmer: the "start offset" is given in *bytes*, not characters. I don't remember what language this was... it might have been Haskell? Whatever it was, it horrified me. > UTF-8 seems like a > reasonable choice for an internal representation to me. It's not. Haskell, for example, uses UTF-8 internally, and it warns that this makes string operations O(N) instead of O(1) precisely because of the need to walk the string inspecting every byte. Remember, when your string primitives are O(N), it is very easy to write code that becomes O(N**2). Using UTF-8 internally is just begging for user-written code to be O(N**2). > One benefit of > UTF-8 over Python's flexible representation is that it is, on average, > more compact over a wide set of samples. Sure. And over a different set of samples, it is less compact. If you write a lot of Latin-1, Python will use one byte per character, while UTF-8 will use two bytes per character. -- Steven -- http://mail.python.org/mailman/listinfo/python-list