On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody <rustompm...@gmail.com> wrote: > C's string is not bug-prone its plain buggy as it cannot represent strings > with nulls. > > I would not go that far for UTF-16. > It is bug-inviting but it can also be implemented correctly
C's standard library string handling functions are restricted in that they handle a 255-byte alphabet. They do not handle Unicode, they do not handle NUL, that is simply how they are. But I never said I was talking about the C standard library. If you type a text string into a GUI entry field, or encode it quoted-printable and pass it to a web server, or whatever, you shouldn't know or care about what language the program is written in; and if that program barfs on a NUL, that's a limitation. That limitation might be caused by its naive use of strcpy() when it should have used memcpy(), but that's not your problem. It's exactly the same here: if your program chokes on an SMP character, I don't care what your program was written in or what library functions your program called on. All I care is that your program - repeated for emphasis, *your* program - failed on that input. It's up to you to choose your underlying functions appropriately. >> - If you are designing your own language, your implementation of Unicode >> strings should use something like Python's FSR, or UTF-8 with tweaks to >> make string indexing O(1) rather than O(N), or correctly-implemented >> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) > > FSR is possible in python for very specific pythonic reasons > - dynamicness > - immutable strings > > Drop either and FSR is impossible I don't know what you mean by "dynamicness". What you do need is a Unicode string type, such that the application program isn't aware of the underlying bytes, but simply treats this string as a sequence of code points. The immutability isn't technically a requirement, but it does make the FSR much more manageable; in a language with mutable strings, it's probably more efficient to use UTF-32 for simplicity, but it's up to the language designer to figure that out. (It might be best to use something like the FSR, but where strings are never narrowed after being widened, so it'd be possible for an ASCII-only string to be stored UTF-32. That has consequences for comparisons, but might give a reasonable hybrid of storage and mutation performance.) > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) > allowed by Tcl > > So who/what is broken? The exception is pretty clear on that point. Tcl can't handle SMP characters. So it's Tcl that's broken. Unless there's evidence to the contrary, that's what I would expect to be the case. > Correct. > Windows is broken for using UTF-16 > Linux is broken for conflating UTF-8 and byte string. > > Lot of breakage out here dont you think? > May be related to the equation > > UTF-16 = UCS-2 + Duct-tape UTF-16 is an encoding that was designed to be backward-compatible with UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it what you will, but backward compatibility is pretty important. Look at things like DES3 - if you use the same key three times, it's compatible with DES. Linux isn't "broken" for conflating UTF-8 and byte strings. Linux is flawed in that it defines file names to be byte strings, which means that every file system could be different in what it actually uses as the encoding. Since file names exist for the benefit of humans, they should be treated as text, so we should work with them as text. But for reasons of backward compatibility, Linux hasn't yet changed. Windows isn't broken for using UTF-16. I think it's a poor trade-off, given that so many file names are ASCII-only; and, of course, if any program treats a Windows file name as UCS-2, then that program is broken. But UTF-16 is not itself broken, any more than UTF-7 is. And UTF-7 is a lot harder to work with. ChrisA -- https://mail.python.org/mailman/listinfo/python-list