OK, this has turned into a long essay, so unless questions are really
addressed to me explicitly, I will try to avoid writing anything else
on this subject.

Here's my Jeremiad on Unicode.  Take it for what it's worth.


"Johny Mattsson (EPA)" wrote:
> 
>    Part 1.1Type: Plain Text (text/plain)

| If we settle on wchar_t being 16bits, then we will still be forced to do
| UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string,
| since we must deal with that charming thing known as "surrogate pairs" (see
| section 3.7 of the Unicode standard v3.0). This again breaks the "one
| wchar_t == on character". When being forced to deal with Unicode, I much
| prefer working with 32bits, since that guarantees that I get a fixed length
| for each character. Admittedly, it is space inefficient to the Nth degree,
| but speedwise it is better.

ISO/IEC 10646-1 doesn't have any code points allocated above the
low 16 bits.  It's the same as the Unicode 1.1 standard.

Unicode 3.0 throws a whole lot of dead languages into the mix,
or it tries to allocate seperate code points for non-existant
character sets, whose glyphs should be, according the Unicode
Philosophy that resulted in the controversial CJK unification,
unified with existing glyphs within the character set.  Unicode,
after all, is a character set standard, not a font encoding
standard.

Unicode 3.x has not been ratified as an ISO/IEC standard, and it
may not ever be.  So Unicode 3.x incursions above 16 bits are not
really a valid argument until Unicode 3.x is standardized in some
way other than administrative fiat by the Unicode Consortium
having published a new version to sell more books and justify its
continued existance to the people funding it.

--

Historically, I've really had a love/hate relationship with
Unicode.

When Unicode was originally designed, it was intentionally
designed to exclude fixed-cell rendering technologies: if
the font was pre-rendeered, you could not render characters
with ligatures intact.

Personally, I blame this in the fact that Taligent, the real
driving force behind the first Unicode standard, was an IBM
and Apple joint venture, and owed its pocket books to rendering
technologies like Display PostScript, which were direct
competitors with X Windows... and X Windows uses fixed cell
rendering technology, even when it's using TrueType fonts.

So when Unicode first came out, the "private use" areas were
not lare enough, nor sufficinetly near or interleaved with,
that of ligatured languages, like Tamil and Devengari, or
even Arabic and Hebrew.

There was a fundamental assumption that the rendering technology
would be disjoint from the encoding technology, and that the
cost, due to the arrangement of the "private use" areas, was to
be bourne in the rendering engine.  And rendering engines where
that was not possible (e.g. X Windows) would just have to paint
pixels and eat the overhead in the applications (and they did;
you can install "xtamil" from ports and see how it works).

The Japanese *hate* Unicode.  The primary reason for this hate
is, to be blunt, that Unicode is not a superset of JIS-208 or
JIS-208 + JIS-212; the secondary reason is that Japanese is as
nearly protectionist as French, and the CJK unification used
the Chinese dictionary order.  There is a good reason for this,
however: Chinese dictionary order is capable of classifying
Japanese Ideograms.  A simplification of this is that Chinese
dictionary classification is in "stroke, radical" order; thus
it is capable of classifying ideograms that "look like" they
are Chinese ideograms.  The Japanese classification system is
not capable of doing this, and the Japanese have two widely
recognized classification systems for lexical ordering internal
to Japan, so it's not even possible to pick a "right order" if
you were to say "all the Japanese characters, *then* all the
Chinese characters.

In practice, this is a subject for academics who care about the
number of angels which can dance on the head of a pin.  But it
has a slightly deeper protectionist agenda, as well.  The
Japanese computer market, for a very long time, was not a
commoditized market.  Perhaps the largest market share went to
the NEC-PC98 (indeed, there's explicit support in FreeBSD for
this machine).  In such a market, it's possible to create
products which are non-commodity, and end up "owning" customers.
In addition, things like EUC encoding and XPG/4 are rarely
supported by non-Japanese software titles, which protects the
local software production market.  MITI, in fact, has as one of
its mandates, the protection of a market for locally produced
software.

Microsoft's introduction of Unicode, and the subsequent ability
of third party software written solely to support Microsoft
interfaces that used "oleString" and other wchar_t types
natively, meant that there was immediate support for Japanese
in these products.  Microsoft broke down the wall that had
been built in order to protect local markets.

So, getting back to the main line of dicussion, with this
backround in hand:

| As for interoperability with Windows, it is clearly stated that the wchar_t
| is intended for internal usage only, and the various encoding schemes should
| be used when storing strings outside of a process. In reality this means
| that just about every Unicode capable application reads and writes in UTF-8
| or 7. This means that interoperability should not become an issue. If it
| really was expected to have been an issue, I'm sure the C++ standard would
| have mandated a specific width for wchar_t, which as far as I am aware they
| didn't.

Microsoft's OLESS (OLE Structured Storage), which is the storage
type that it uses for most Microsoft applications, these days,
has the capability of natively storing and retreiving OLE types,
including "oleString".

Basically, this means that there is no conversion of the textual
data on its way in or out.

Your proposal, to take the phrase "internal use only" literally,
is flawed.

What it basically comes down to is the requirement for explicit
extra work to be done in order to support both l10n (localization)
and i18n (internationalization) in applications, rather than the
appilcations implicitly supporting them... as Microsoft applications
explicitly support them.

The net effect of doing this is that we will end up with a lot of
code, which, even if our hopes are realized, and it is 8-bit clean,
is missing a significant amount of engineering work, which would
be necessary for it to support languages which are not supported
by the 8-bit ISO-8859-1 (Latin-1) character set -- or whichever
8-bit character set has been selected as the local primary default
attribute assumed on otherwise unattributed text files.

Therefore, however the problem is handled, it is a good idea to
make sure that default applications, written without the ability
to explicitaly convert between internal (processing) and external
(storage) formats, still work for languages other than English.

| So, in the light of this, what would be the most appropriate choice? I
| haven't yet had a chance to explore what locales we support, but I would
| lean toward saying wchar_t == 32 bits, since this is future proof. If we
| later down the track are forced to go from 16 -> 32 due us supporting more
| of the asian locales, I foresee this causing _major_ breakage.

This ignores the interoperability issue which I originally
raised.  I realize that "Windows" is a "dirty word" (though we
can all reread the above, and see how clever they are, even if
we want to pretend that they are technically inept), but... I
would like to see it possible to interoperate with third party
ELF libraries initially intended for use with Windows.

What this means is adoption of decisions similar to those made
by Windows in regards to minimum size assumptions with regard
to intrinsic types.  Or even *exact* correspondance on size,
for values which may be externalized via librayr, COM, DCOM, or
some other marshalling facility that's going to assume that the
sizes of things marshalled in are going to be the same on the
way out.

You *could* "embrace and extend" what Microsoft has put out
there; however, lacking the ability to wield monopolistic power
in the marketplace, it's highly unlikely that your screams of
"My way is the right way!" will be heard over the steamroller
engine.

I guess a good point to make here is the size of a single
element in a "String" type in Java?

In any case, I rather expect that most of .NET is going to be
assuming 16 bit wchar's.  Someone who knows, rather than just
"expecting" needs to speak up here.

--

In any case: expect to multiply your real storage requirements
by a factor of 2 for 16 bit Unicode, and a factor of 4, for 32
bit Unicode.

Unless you happen to be an English speaker who never bothers
setting the 8th bit on any of your text, UTF encoding is a raw
deal.  Further, it will have a tendency to reduce the market
portability of software you write, no matter what, and without
a lot of extraordinary effort, expect that your code will only
be locally salable.

For Europeans, 8-bit clean won't save you any more.  You will
end up having to take 2 characters to store any character in
the range 0x80-0xff.  So no matter what, you European programmers
will be screwed by having storage encoding different from process
encoding.

--

Look: I know that "Microsoft Invented it" is the kiss of death,
but isn't it possible to admit, just this once, that they maybe
had a good idea, and copy them?

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Reply via email to