On 03/18/2016 02:26 AM, Jussi Piitulainen wrote: > I think Julia's way of dealing with its strings-as-UTF-8 [2] is more > promising. Indexing is by bytes (1-based in Julia) but the value at a > valid index is the whole UTF-8 character at that point, and an invalid > index raises an exception.
This seems to me to be a leaky abstraction. Julia's approach is interesting, but it strikes me as somewhat broken as it pretends to do O(1) indexing, but in reality it's still O(n) because you still have to iterate through the bytes until you find, say, the nth time that doesn't raise an exception. Except for dealing with the ASCII subset of UTF-8, I can't really see any time when grabbing whatever resides at the nth byte of a UTF-8 string would be useful. > I work with text all the time, but I don't think I ever _need_ arbitrary > access to an nth character. What I require is access to the start and > end of a string, searching, and splitting. These all seem compatible > with using UTF-8 representations. Same with iterating over the string > (forward or backward). Indeed, this is the argument from the web site http://utf8everywhere.org. Their argument is that often individual unicode code points don't make sense by themselves, so there's no point in chopping up a Unicode string. Many unicode strings only make sense if you start at the beginning and read and interpret the code points as you go. Hence UTF-8's requirement that you have to always start at the beginning you want to find the nth code point is not a burden. I guess whether or not you need to find the nth character depends on the strength of string functions. If I searched a string for a particular delimiter, I could see it being useful to get whatever is just past the delimiter, for example. Though Python's split() method eliminates the need to do that by hand. -- https://mail.python.org/mailman/listinfo/python-list