Ian Kelly writes: > On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson > <rantingrickjohn...@gmail.com> wrote: >> In the event that i change my mind about Unicode, and/or for >> the sake of others, who may want to know, please provide a >> list of languages that *YOU* think handle Unicode better than >> Python, starting with the best first. Thanks. > > jmf has been asked this before, and as I recall he seems to feel that > UTF-8 should be used for all purposes, ignoring the limitations of > that encoding such as that indexing becomes a O(n) operation. He has > pointed at Go as an example of a language wherein Unicode "just > works", although I think that others do not necessarily agree [1].
... > [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go I think Julia's way of dealing with its strings-as-UTF-8 [2] is more promising. Indexing is by bytes (1-based in Julia) but the value at a valid index is the whole UTF-8 character at that point, and an invalid index raises an exception. The letters "ö" and "ä" are two bytes each in UTF-8. julia> s = "myöhä" "myöhä" julia> s[3] 'ö' julia> s[4] ERROR: UnicodeError: invalid character index in next at ./unicode/utf8.jl:65 in getindex at strings/basic.jl:37 julia> s[5] 'h' Julia provides access to the next character at an index and the valid index after that: julia> next(s, 3) ('ö',5) The last valid index: julia> endof(s) 6 Special syntax to index at the end of a string: julia> s[end - 1:end] "hä" That's not quite right. The penultimate character happened to be one byte, so it worked. At least incorrect indexing results in an exception rather than an incorrect value. There is a proper method to get a previous valid index - I should have used that. Also, the length of a string is the number of characters rather than bytes, decoupled from the indexing. julia> length("myöhä") 5 I work with text all the time, but I don't think I ever _need_ arbitrary access to an nth character. What I require is access to the start and end of a string, searching, and splitting. These all seem compatible with using UTF-8 representations. Same with iterating over the string (forward or backward). Just in case: I've been quite happy with Unicode in Python 3. It's just interesting to see a different way that also seems to work. [2] http://docs.julialang.org/en/release-0.4/manual/strings/ -- https://mail.python.org/mailman/listinfo/python-list