Rustom Mody wrote: > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
[snip example of an analogous situation with NULs] > Strawman. Sigh. If I had a dollar for every time somebody cried "Strawman!" when what they really should say is "Yes, that's a good argument, I'm afraid I can't argue against it, at least not without considerable thought", I'd be a wealthy man... > Lets please stick to UTF-16 shall we? > > Now tell me: > - Is it broken or not? The UTF-16 standard is not broken. It is a perfectly adequate variable-width encoding, and considerably better than most other variable-width encodings. However, many implementations of UTF-16 are faulty, and assume a fixed-width. *That* is broken, not UTF-16. (The difference between specification and implementation is critical.) > - Is it widely used or not? It's quite widely used. > - Should programmers be careful of it or not? Programmers should be aware whether or not any specific language uses UTF-16 and whether the implementation is buggy. That will help them decide whether or not to use that language. > - Should programmers be warned about it or not? I'm in favour of people having more knowledge rather than less. I don't believe that ignorance is bliss, except perhaps in the case that a giant asteroid the size of Texas is heading straight for us. Programmers should be aware of the limitations or bugs in any UTF-16 implementation they are likely to run into. Hence my general recommendation: - For transmission over networks or storage on permanent media (e.g. the content of text files), use UTF-8. It is well-implemented by nearly all languages that support Unicode, as far as I know. - If you are designing your own language, your implementation of Unicode strings should use something like Python's FSR, or UTF-8 with tweaks to make string indexing O(1) rather than O(N), or correctly-implemented UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte per code point format, you fail. - If you are using an existing language, be aware of any bugs and limitations in its Unicode implementation. You may or may not be able to work around them, but at least you can decide whether or not you wish to try. - If you are writing your own file system layer, it's 2015 fer fecks sake, file names should be Unicode strings, not bytes! (That's one part of the Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file system, whichever you please, but again remember that both are variable-width formats. -- Steven -- https://mail.python.org/mailman/listinfo/python-list