On Fri, Mar 18, 2016 at 8:56 AM, Random832 <random...@fastmail.com> wrote: > On Fri, Mar 18, 2016, at 03:00, Ian Kelly wrote: >> jmf has been asked this before, and as I recall he seems to feel that >> UTF-8 should be used for all purposes, ignoring the limitations of >> that encoding such as that indexing becomes a O(n) operation. > > Just to play devil's advocate, here, why is it so bad for indexing to be > O(n)? Some simple caching is all that's needed to prevent it from making > iteration O(n^2), if that's what you're worried about.
What kind of caching do you have in mind? If you're just going to index the string, then that's at least an extra byte per character, which mostly kills the memory savings that is usually the goal of using UTF-8 in the first place. It's not the only drawback, either. If you want to know anything about the characters in the string that you're looking at, you need to know their codepoints. If the string is simple UCS-2, that's easy. Just take the two bytes and cast them as a 16-bit integer (assuming that the endianness of the string matches the machine). If the string is UTF-8 then it has to be decoded, so you need to figure out exactly how many bytes are in this particular character, and then from those determine which bits you need and then mash those bits together to form the actual integer codepoint. Now think about doing that over and over again in the context of a lexicographical sort. -- https://mail.python.org/mailman/listinfo/python-list