Rustom Mody wrote: > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: >> > >> > It lists some examples of software that somehow break/goof going from >> > BMP-only unicode to 7.0 unicode. >> > >> > IOW the suggestion is that the the two-way classification >> > - ASCII >> > - Unicode >> > >> > is less useful and accurate than the 3-way >> > >> > - ASCII >> > - BMP >> > - Unicode >> >> How is that more useful? Aside from storage optimizations (in which >> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is >> not significantly different from the rest of Unicode. > > Sorry... Dont understand.
Chris is suggesting that going from BMP to all of Unicode is not the hard part. Going from ASCII to the BMP part of Unicode is the hard part. If you can do that, you can go the rest of the way easily. I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. UTF-16 is different, and that's probably why you think supporting all of Unicode is hard. With UTF-16, there really is an obvious distinction between the BMP and the SMP: that's where you jump from a single 2-byte unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8 or UTF-32: - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you support the SMP or not doesn't change the fact that you have to deal with multi-byte characters. - In UTF-32, everything is fixed-width whether it is in the BMP or not. In both cases, supporting the SMPs is no harder than supporting the BMP. It's only UTF-16 that makes the SMP seem hard. Conclusion: faulty implementations of UTF-16 which incorrectly handle surrogate pairs should be replaced by non-faulty implementations, or changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be upgraded. Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new standard that is just like obsolete Unicode version 1. Unicode version 1 is obsolete for a reason. 16 bits is not enough for even existing languages, let alone all the code points and characters that are used in human communication. >> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why >> do you keep talking about 7.0 as if it's a recent change? > > It is 2015 as of now. 7.0 is the current standard. > > The need for the adjective 'current' should be pondered upon. What's your point? The UTF encodings have not changed since they were first introduced. They have been stable for at least twenty years: UTF-8 has existed since 1993, and UTF-16 since 1996. Since version 2.0 of Unicode in 1996, the standard has made "stability guarantees" that no code points will be renamed or removed. Consequently, there has only been one version which removed characters, version 1.1. Since then, new versions of the standard have only added characters, never moved, renamed or deleted them. http://unicode.org/policies/stability_policy.html Some highlights in Unicode history: Unicode 1.0 (1991): initial version, defined 7161 code points. In January 1993, Rob Pike and Ken Thompson announced the design and working implementation of the UTF-8 encoding. 1.1 (1993): defined 34233 characters, finalised Han Unification. Removed some characters from the 1.0 set. This is the first and only time any code points have been removed. 2.0 (1996): First version to include code points in the Supplementary Multilingual Planes. Defined 38950 code points. Introduced the UTF-16 encoding. 3.1 (2001): Defined 94205 code points, including 42711 additional Han ideographs, bringing the total number of CJK code points alone to 71793, too many to fit in 16 bits. 2006: The People's Republic Of China mandates support for the GB-18030 character set for all software products sold in the PRC. GB-18030 supports the entire Unicode range, include the SMPs. Since this date, all software sold in China must support the SMPs. 6.0 (2010): The first emoji or emoticons were added to Unicode. 7.0 (2014): 113021 code points defined in total. > In practice, standards change. > However if a standard changes so frequently that that users have to play > catching cook and keep asking: "Which version?" they are justified in > asking "Are the standard-makers doing due diligence?" Since Unicode has stability guarantees, and the encodings have not changed in twenty years and will not change in the future, this argument is bogus. Updating to a new version of the standard means, to a first approximation, merely allocating some new code points which had previously been undefined but are now defined. (Code points can be flagged deprecated, but they will never be removed.) -- Steven -- https://mail.python.org/mailman/listinfo/python-list