On 2017-09-05, at 06:36, Pew, Curtis G wrote:
>
> Unicode was originally supposed to be a fixed-width, 16-bit encoding.
> Fixed-width was actually a design criteria for the original developers. It
> was only after it became clear that there was no possible way to fit all the
> needed characters into 16 bits that the “astral planes”[1] were (reluctantly)
> added to Unicode and the various UTF encodings defined. In this light, UTF-16
> is the closest thing to the original version of Unicode. Also, if your text
> includes few or no Latin characters UTF-16 may be just as compact, or even
> more compact, than UTF-8, and can probably be processed more easily.
>
Are you confusing UTF-16 and UCS-2?
https://en.wikipedia.org/wiki/UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding
capable of encoding all 1,112,064 valid code points of Unicode. The
encoding is variable-length, as code points are encoded with one or two
16-bit code units. (also see Comparison of Unicode encodings for a
comparison of UTF-8, -16 & -32)
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2
(for 2-byte Universal Character Set) once it became clear that 16 bits were
not sufficient for Unicode's user community.[1]
> Since Java was developed when Unicode was still supposed to be a 16-bit
> encoding the early versions at least used what we would now call UTF-16. As I
> recall, there was a significant period of time after Unicode abandoned a
> fixed-width 16-bit representation before Java implementations really
> supported characters from the “astral planes”.
>
>
> [1] Unicode is still organized into 64K ranges called “planes”. The original
> 0–xFFFF range is called the “Basic Multilingual Plane” (BMP) and “astral
> planes” is a convenient nickname for the other ranges.
-- gil
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN