Rustom Mody wrote: > On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: [...] >> Chris is suggesting that going from BMP to all of Unicode is not the hard >> part. Going from ASCII to the BMP part of Unicode is the hard part. If >> you can do that, you can go the rest of the way easily. > > Depends where the going is starting from. > I specifically names Java, Javascript, Windows... among others. > Here's some quotes from the supplementary chars doc of Java > http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html > > | Supplementary characters are characters in the Unicode standard whose > | code points are above U+FFFF, and which therefore cannot be described as > | single 16-bit entities such as the char data type in the Java > | programming language. Such characters are generally rare, but some are > | used, for example, as part of Chinese and Japanese personal names, and > | so support for them is commonly required for government applications in > | East Asian countries... > > | The introduction of supplementary characters unfortunately makes the > | character model quite a bit more complicated. > > | Unicode was originally designed as a fixed-width 16-bit character > | encoding. The primitive data type char in the Java programming language > | was intended to take advantage of this design by providing a simple data > | type that could hold > | any character.... Version 5.0 of the J2SE is required to support > | version 4.0 of the Unicode standard, so it has to support supplementary > | characters. > > My conclusion: Early adopters of unicode -- Windows and Java -- were > punished > for their early adoption. You can blame the unicode consortium, you can > blame the babel of human languages, particularly that some use characters > and some only (the equivalent of) what we call words.
I see you are blaming everyone except the people actually to blame. It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years ago, the same year as 1.0 release of Java. Java has had eight major new releases since then. Oracle, and Sun before them, are/were serious, tier-1, world-class major IT companies. Why haven't they done something about introducing proper support for Unicode in Java? It's not hard -- if Python can do it using nothing but volunteers, Oracle can do it. They could even do it in a backwards-compatible way, by leaving the existing APIs in place and adding new APIs. As for Microsoft, as a member of the Unicode Consortium they have no excuse. But I think you exaggerate the lack of support for SMPs in Windows. Some parts of Windows have no SMP support, but they tend to be the oldest and less important (to Microsoft) parts, like the command prompt. Anyone have Powershell and like to see how well it supports SMP? This Stackoverflow question suggests that post-Windows 2000, the Windows file system has proper support for code points in the supplementary planes: http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua Or maybe not. > Or you can skip the blame-game and simply note the fact that large > segments of extant code-bases are currently in bug-prone or plain buggy > state. > > This includes not just bug-prone-system code such as Java and Windows but > seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? >> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in >> UTF-8 and UTF-32, since that goes against the grain of the system. You >> would have to program in artificial restrictions that otherwise don't >> exist. > > Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0 > irrelevant. Glad you agree about that much at least. [...] >> Conclusion: faulty implementations of UTF-16 which incorrectly handle >> surrogate pairs should be replaced by non-faulty implementations, or >> changed to UTF-8 or UTF-32; incomplete Unicode implementations which >> assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should >> be upgraded. > > Imagine for a moment a thought experiment -- we are not on a python but a > java forum and please rewrite the above para. There is no need to re-write it. If Java's only implementation of Unicode assumes that code points are 16 bits only, then Java needs a new Unicode implementation. (I assume that the existing one cannot be changed for backwards-compatibility reasons.) > Are you addressing the vanilla java programmer? Language implementer? > Designer? The Java-funders -- earlier Sun, now Oracle? The last three should be considered the same people. The vanilla Java programmer is not responsible for the short-comings of Java's implementation. [...] >> > In practice, standards change. >> > However if a standard changes so frequently that that users have to >> > play catching cook and keep asking: "Which version?" they are justified >> > in asking "Are the standard-makers doing due diligence?" >> >> Since Unicode has stability guarantees, and the encodings have not >> changed in twenty years and will not change in the future, this argument >> is bogus. Updating to a new version of the standard means, to a first >> approximation, merely allocating some new code points which had >> previously been undefined but are now defined. >> >> (Code points can be flagged deprecated, but they will never be removed.) > > Its not about new code points; its about "Fits in 2 bytes" to "Does not > fit in 2 bytes" I quote you again: "if a standard changes so frequently..." The move to more than 16 bits happened once. It happened almost 20 years ago. In what way does this count as frequent changes? > If you call that argument bogus I call you a non computer scientist. I am not a computer scientist, and the argument remains bogus. Unicode does not change "frequently", and changes are backward-compatible. > [Essentially this is my issue with the consortium it seems to be working > [like a bunch of linguists not computer scientists] That's rather like complaining that some computer game looks like it was designed by games players instead of theoreticians. "Why, people have FUN playing this, almost like it was designed by professionals who think about gaming!!!" Unicode is a standard intended for the handling of human languages. It is intended as a real-life working standard, not some theoretical toy for academics to experiment with. It is designed to be used, not to have papers written about it. The character set part of it has effectively been designed by linguists, and that is a good thing. But the encoding side of things has been designed by practising computer programmers such as Rob Pike and Ken Thompson. You might have heard of them. > Here is Roy's Smith post that first started me thinking that something may > be wrong with SMP > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ There are plenty of things wrong with some implementations of Unicode, those that assume all code points are two bytes. There may be a few things wrong with the current Unicode standard, such as missing characters, characters given the wrong name, and so forth. But there's nothing wrong with the design of the SMP. It allows the great majority of text, probably 99% or more, to use two bytes (UTF-16) or no more than three bytes (UTF-8), while only relatively specialised uses need four bytes for some code points. > Some parts are here some earlier and from my memory. > If details wrong please correct: > - 200 million records > - Containing 4 strings with SMP characters > - System made with python and mysql. SMP works with python, breaks mysql. > So whole system broke due to those 4 in 200,000,000 records No, they broke because MySQL has buggy Unicode handling. Bugs are not unusual. I used to have a version of Apple's Hypercard which would lock up the whole operating system if you tried to display the string "0^0" in a message dialog. Given that classic Mac OS was not a proper multi-tasking OS like Unix or OS-X or even Windows, this was a real pain. My conclusion from that is that that version of Hypercard was buggy. What is your conclusion? > I know enough (or not enough) of unicode to be chary of statistical > conclusions from the above. > My conclusion is essentially an 'existence-proof': > > SMP-chars can break systems. Oh come on. How about this instead? X can break systems, for every conceivable value of X. > The breakage is costly-fied by the combination > - layman statistical assumptions > - BMP → SMP exercises different code-paths > > It is necessary but not sufficient to test print "hello world" in ASCII, > BMP, SMP. You also have to write the hello world in the database -- mysql > Read it from the webform -- javascript > etc etc Yes. This is called "integration testing". That's what professionals do. > You could also choose do with "astral crap" (Roy's words) what we all do > with crap -- throw it out as early as possible. And when Roy's customers demand that his product support emoji, or complain that they cannot spell their own name because of his parochial and ignorant idea of "crap", perhaps he will consider doing what he should have done from the beginning: Stop using MySQL, which is a joke of a database[1], and use Postgres which does not have this problem. [1] So I have been told. -- Steven -- https://mail.python.org/mailman/listinfo/python-list