> On Mon, Apr 02, 2018 at 09:39:05AM +0200, Andre Majorel wrote: > >I wouldn't say that. UTF-8 breaks a number of assumptions. For > >instance, > >1) every character has the same size, > >2) every byte sequence is a valid character, > >3) the equality or inequality of two characters comes down to > > the equality or inequality of the bytes they encode to.
I am sure you do not realize that none of these assumptions are really met by any encoding and none of these assumptions actually bring something. They are just rehashed poor arguments to rationalize a fear of change by people afraid their long-earned knowledge will become obsolete. I do not know if you fit in that category. Odds are you have just been misinformed after having normal trouble during the transition. Darac Marjal (2018-04-03): > If these things matter to you, it's better to convert from UTF-8 to Unicode, > first. "Convert to Unicode" does not mean anything. Unicode is not a format, and therefore you cannot convert something to it. Unicode is a catalog of "infolinguistic entities". I do not say characters, because they are not all. Most of Unicode is characters, but not all. As is stands, the principle of Unicode is that any "pure" text can be represented as a sequence of Unicode code points. For storing in a file or sending on the network, this sequence of code points must be converted into a sequence of octets. UTF-8 is by far the best choice for that, because it has many interesting properties. Other encodings suffer from being incomplete, being incompatible with ASCII, being sensible to endianness problems, being subject to corruption or all of the above. > I tend to think of Unicode as an arbitrarily large code page. Each > character maps to a number, but that number could be 1, 1000 or 500_000 > (Unicode seems to be growing without might end in sight). The twentieth century just called, it wanks the "code page" idiom back. > Internally, you > might store those code points as Integers or QUad Words or whatever you > like. Only once you're ready to transfer the text to another process (print > on screen, save to a file, stream across a network), do you convert the > Unicode back into UTF-8. Internally, this is a reasonable choice in some cases, but not actually that useful in most. Your discourse is based on the assumption that accessing a single Unicode code point in the string would be useful. Most of the time, it is not. Remember that an Unicode code point is not necessarily a character. You need to decide the data structure based on the treatments you intend to perform on the text. And actually, most of the time using an array of octets in UTF-8 is the best choice for internal representation too. > Basically, you consider UTF-8 to be a transfer-only format (like Base64). If > you want to do anything non-trivial with it, decode it into Unicode. No, definitely not. If somebody wants to do anything non-trivial with text, then either they already know what they are doing better than this, and do not need that advice, or they do not and they will get it wrong. Use a library. And use whatever text format that library use. The problem does not come from UTF-8 or Unicode or anything computer-related, the problem comes from the principle of written human text: writing systems are insanely complex. Regards, -- Nicolas George
signature.asc
Description: Digital signature