Re: UTF-8

Bruce Dubbs Sat, 21 Jan 2006 17:22:11 -0800

Joe Ciccone wrote:

> The most interesting thing from a programmers point of view is the way
> the characters are handled. This is the reason why incompatibilites
> exist. A non-UTF-8 character, char, is 4 bits whereas a UTF-8 character,
> wchar, is 32bits. It's hard to write code to properly support both types
> of locales. Also, wchar processing code is slightly slower then char
> processing code. Most programmers try to avoid it, including myself.


Well ASCII is technically 7 bits, but most systems recognize Latin1
which is 8 bits.  IIRC UTF-8 characters are actually 1, 2, 3, or 4 bytes
depending on the character.  The first 128 UTF-8 characters are
identical to ASCII.  The vast majority of characters are 16-bits.  There
are somewhere around 30-40K character glyphs defined.

Programmers do have to allow for 4 byte characters when manipulating UTF-8.

  -- Bruce
-- 
http://linuxfromscratch.org/mailman/listinfo/lfs-dev
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

Re: UTF-8

Reply via email to