Steve D'Aprano <steve+pyt...@pearwood.info>: > On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote: >> Of course, UTF-8 in a bytes object doesn't make the situation any >> better, but does it make it any worse? > > Sure it does. You want the human reader to be able to predict the > number of graphemes ("characters") by sight. Okay, here's a string in > UTF-8, in bytes: > > e288b4c39fcf89e289a0d096e280b0e282ac78e2889e > > How do you expect the human reader to predict the number of graphemes > from a UTF-8 hex string? > > For the record, that's 44 hex digits or 22 bytes, to encode 9 > graphemes. That's an average of 2.44 bytes per grapheme. Would you > expect the average programmer to be able to predict where the grapheme > breaks are? > >> As it stands, we have >> >> è --[encode>-- Unicode --[reencode>-- UTF-8 > > I can't even work out what you're trying to say here.
I can tell, yet that doesn't prevent you from dismissing what I'm saying. >> Why is one encoding format better than the other? > > It depends on what you're trying to do. > > If you want to minimize storage and transmission costs, and don't care > about random access into the string, then UTF-8 is likely the best > encoding, since it uses as little as one byte per code point, and in > practice with real-world text (at least for Europeans) it is rarely > more expensive than the alternatives. Python3's strings don't give me any better random access than UTF-8. Storage and transmission costs are not an issue. It's only that storage and transmission are still defined in terms of bytes. Python3's strings force you to encode/decode between strings and bytes for a yet-to-be-specified advantage. > It also has the advantage of being backwards compatible with ASCII, so > legacy applications that assume all characters are a single byte will > work if you use UTF-8 and limit yourself to the ASCII-compatible > subset of Unicode. UTF-8 is perfectly backward-compatible with ASCII. > The disadvantage is that each code point can be one, two, three or > four bytes wide, and naively shuffling bytes around will invariably > give you invalid UTF-8 and cause data loss. So UTF-8 is not so good as > the in-memory representation of text strings. The in-memory representation is not an issue. It's the abstract semantics that are the issue. At the abstract level, we have the text in a human language. Neither strings nor UTF-8 provide that so we have to settle for something cruder. I have yet to hear why a string does a better job than UTF-8. > If you have lots of memory, then UTF-32 is the best for in-memory > representation, because its a fixed-width encoding and parsing it is > simple. Every code point is just four bytes and you an easily > implement random access into the string. The in-memory representation is not an issue. It's the abstract semantics that are the issue. > If you want a reasonable compromise, UTF-16 is quite decent. If you're > willing to limit yourself to the first 2**16 code points of Unicode, > you can even pretend that its a fixed width encoding like UTF-32. UTF-16 (used by Windows and Java, for example) is even worse than strings and UTF-8 because: è --[encode>-- Unicode --[reencode>-- UTF-16 --[reencode>-- bytes > If you have to survive transmission through machines that require > 7-bit clean bytes, then UTF-7 is the best encoding to use. I don't know why that is coming into this discussion. So no raison-d'être has yet been offered for strings. Marko -- https://mail.python.org/mailman/listinfo/python-list