On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote: > On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote: > >> Broken systems can be shown up by anything. Suppose you have a program > >> that breaks when it gets a NUL character (not unknown in C code); is > >> the fault with the Unicode consortium for allocating something at > >> codepoint 0, or the code that can't cope with a perfectly normal > >> character? > > > > Strawman. > > Not really, no. I know of lots of programs that can't handle embedded > NULs, and which fail in various ways when given them (the most common > is simple truncation, but it's by far not the only way).
Ah well if you insist on pursuing the nul-char example... No the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0 Nor the code that "can't cope with a perfectly normal character?" But with C for having a data structure called string with a 'hole' in it. And it's > exactly the same: a program that purports to handle arbitrary Unicode > text should be able to handle arbitrary Unicode text, not "Unicode > text as long as it contains only codepoints within the range X-Y". It > doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or > U+1F4A3 - if your code blows up, it's a failure in your code. > > > Lets please stick to UTF-16 shall we? > > > > Now tell me: > > - Is it broken or not? > > - Is it widely used or not? > > - Should programmers be careful of it or not? > > - Should programmers be warned about it or not? > > No, UTF-16 is not itself broken. (It would be if we expected > codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap > on Unicode, but it's looking unlikely that we'll be needing any more > than that anyway.) What's broken is code that tries to treat UTF-16 as > if it's UCS-2, and then breaks on surrogate pairs. > > Yes, it's widely used. Programmers should probably be warned about it, > but only because its tradeoffs are generally poorer than UTF-8's. If > you use it correctly, there's no problem. > > > Also: > > Can a programmer who is away from UTF-16 in one part of the system (say by > > using python3) > > assume he is safe all over? > > I don't know what you mean here. Do you mean that your Python 3 > program is "at risk" in some way because there might be some other > program that misuses UTF-16? Yes some other program/library/API etc connected to the python one > Well, sure. And there might be some other > program that misuses buffer sizes, SQL queries, or shell invocations, > and makes your overall system vulnerable to buffer overruns or > injection attacks. These are significantly more likely AND more > serious than UTF-16 misuses. And you still have not proven anything > about SMP characters being a problem, but only that code can be > broken. Broken code is still broken code, no matter what your actual > brokenness. Roy Smith (and many other links Ive cited) prove exactly that - an SMP character broke the code. Note: I have no objection to people supporting full unicode 7. Im just saying it may be significantly harder than just "Use python3 and you are done" -- https://mail.python.org/mailman/listinfo/python-list