OK, this has turned into a long essay, so unless questions are really addressed to me explicitly, I will try to avoid writing anything else on this subject.
Here's my Jeremiad on Unicode. Take it for what it's worth. "Johny Mattsson (EPA)" wrote: > > Part 1.1Type: Plain Text (text/plain) | If we settle on wchar_t being 16bits, then we will still be forced to do | UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string, | since we must deal with that charming thing known as "surrogate pairs" (see | section 3.7 of the Unicode standard v3.0). This again breaks the "one | wchar_t == on character". When being forced to deal with Unicode, I much | prefer working with 32bits, since that guarantees that I get a fixed length | for each character. Admittedly, it is space inefficient to the Nth degree, | but speedwise it is better. ISO/IEC 10646-1 doesn't have any code points allocated above the low 16 bits. It's the same as the Unicode 1.1 standard. Unicode 3.0 throws a whole lot of dead languages into the mix, or it tries to allocate seperate code points for non-existant character sets, whose glyphs should be, according the Unicode Philosophy that resulted in the controversial CJK unification, unified with existing glyphs within the character set. Unicode, after all, is a character set standard, not a font encoding standard. Unicode 3.x has not been ratified as an ISO/IEC standard, and it may not ever be. So Unicode 3.x incursions above 16 bits are not really a valid argument until Unicode 3.x is standardized in some way other than administrative fiat by the Unicode Consortium having published a new version to sell more books and justify its continued existance to the people funding it. -- Historically, I've really had a love/hate relationship with Unicode. When Unicode was originally designed, it was intentionally designed to exclude fixed-cell rendering technologies: if the font was pre-rendeered, you could not render characters with ligatures intact. Personally, I blame this in the fact that Taligent, the real driving force behind the first Unicode standard, was an IBM and Apple joint venture, and owed its pocket books to rendering technologies like Display PostScript, which were direct competitors with X Windows... and X Windows uses fixed cell rendering technology, even when it's using TrueType fonts. So when Unicode first came out, the "private use" areas were not lare enough, nor sufficinetly near or interleaved with, that of ligatured languages, like Tamil and Devengari, or even Arabic and Hebrew. There was a fundamental assumption that the rendering technology would be disjoint from the encoding technology, and that the cost, due to the arrangement of the "private use" areas, was to be bourne in the rendering engine. And rendering engines where that was not possible (e.g. X Windows) would just have to paint pixels and eat the overhead in the applications (and they did; you can install "xtamil" from ports and see how it works). The Japanese *hate* Unicode. The primary reason for this hate is, to be blunt, that Unicode is not a superset of JIS-208 or JIS-208 + JIS-212; the secondary reason is that Japanese is as nearly protectionist as French, and the CJK unification used the Chinese dictionary order. There is a good reason for this, however: Chinese dictionary order is capable of classifying Japanese Ideograms. A simplification of this is that Chinese dictionary classification is in "stroke, radical" order; thus it is capable of classifying ideograms that "look like" they are Chinese ideograms. The Japanese classification system is not capable of doing this, and the Japanese have two widely recognized classification systems for lexical ordering internal to Japan, so it's not even possible to pick a "right order" if you were to say "all the Japanese characters, *then* all the Chinese characters. In practice, this is a subject for academics who care about the number of angels which can dance on the head of a pin. But it has a slightly deeper protectionist agenda, as well. The Japanese computer market, for a very long time, was not a commoditized market. Perhaps the largest market share went to the NEC-PC98 (indeed, there's explicit support in FreeBSD for this machine). In such a market, it's possible to create products which are non-commodity, and end up "owning" customers. In addition, things like EUC encoding and XPG/4 are rarely supported by non-Japanese software titles, which protects the local software production market. MITI, in fact, has as one of its mandates, the protection of a market for locally produced software. Microsoft's introduction of Unicode, and the subsequent ability of third party software written solely to support Microsoft interfaces that used "oleString" and other wchar_t types natively, meant that there was immediate support for Japanese in these products. Microsoft broke down the wall that had been built in order to protect local markets. So, getting back to the main line of dicussion, with this backround in hand: | As for interoperability with Windows, it is clearly stated that the wchar_t | is intended for internal usage only, and the various encoding schemes should | be used when storing strings outside of a process. In reality this means | that just about every Unicode capable application reads and writes in UTF-8 | or 7. This means that interoperability should not become an issue. If it | really was expected to have been an issue, I'm sure the C++ standard would | have mandated a specific width for wchar_t, which as far as I am aware they | didn't. Microsoft's OLESS (OLE Structured Storage), which is the storage type that it uses for most Microsoft applications, these days, has the capability of natively storing and retreiving OLE types, including "oleString". Basically, this means that there is no conversion of the textual data on its way in or out. Your proposal, to take the phrase "internal use only" literally, is flawed. What it basically comes down to is the requirement for explicit extra work to be done in order to support both l10n (localization) and i18n (internationalization) in applications, rather than the appilcations implicitly supporting them... as Microsoft applications explicitly support them. The net effect of doing this is that we will end up with a lot of code, which, even if our hopes are realized, and it is 8-bit clean, is missing a significant amount of engineering work, which would be necessary for it to support languages which are not supported by the 8-bit ISO-8859-1 (Latin-1) character set -- or whichever 8-bit character set has been selected as the local primary default attribute assumed on otherwise unattributed text files. Therefore, however the problem is handled, it is a good idea to make sure that default applications, written without the ability to explicitaly convert between internal (processing) and external (storage) formats, still work for languages other than English. | So, in the light of this, what would be the most appropriate choice? I | haven't yet had a chance to explore what locales we support, but I would | lean toward saying wchar_t == 32 bits, since this is future proof. If we | later down the track are forced to go from 16 -> 32 due us supporting more | of the asian locales, I foresee this causing _major_ breakage. This ignores the interoperability issue which I originally raised. I realize that "Windows" is a "dirty word" (though we can all reread the above, and see how clever they are, even if we want to pretend that they are technically inept), but... I would like to see it possible to interoperate with third party ELF libraries initially intended for use with Windows. What this means is adoption of decisions similar to those made by Windows in regards to minimum size assumptions with regard to intrinsic types. Or even *exact* correspondance on size, for values which may be externalized via librayr, COM, DCOM, or some other marshalling facility that's going to assume that the sizes of things marshalled in are going to be the same on the way out. You *could* "embrace and extend" what Microsoft has put out there; however, lacking the ability to wield monopolistic power in the marketplace, it's highly unlikely that your screams of "My way is the right way!" will be heard over the steamroller engine. I guess a good point to make here is the size of a single element in a "String" type in Java? In any case, I rather expect that most of .NET is going to be assuming 16 bit wchar's. Someone who knows, rather than just "expecting" needs to speak up here. -- In any case: expect to multiply your real storage requirements by a factor of 2 for 16 bit Unicode, and a factor of 4, for 32 bit Unicode. Unless you happen to be an English speaker who never bothers setting the 8th bit on any of your text, UTF encoding is a raw deal. Further, it will have a tendency to reduce the market portability of software you write, no matter what, and without a lot of extraordinary effort, expect that your code will only be locally salable. For Europeans, 8-bit clean won't save you any more. You will end up having to take 2 characters to store any character in the range 0x80-0xff. So no matter what, you European programmers will be screwed by having storage encoding different from process encoding. -- Look: I know that "Microsoft Invented it" is the kiss of death, but isn't it possible to admit, just this once, that they maybe had a good idea, and copy them? -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message