On Thu, Nov 28, 2013 at 01:24:40PM -0500, Strake wrote:
> [..]
>
> > UTF-32 is an encoding that is identical to the unicode point as far as
> > I know. So what I am thinking is that one would either use the UTF-8
> > representation of the Unicode point as an index, or the unicode point
> > itself. Since using UTF-8 would not require any conversion (on UTF-8
> > locales) I think it would be preferrable.
> 
> UTF-8 has variable width, so one must find the length of the sequence
> anyhow and shift it bytewise into an integer, so one may as well just
> use fgetwc or the like and work with codepoints.

You are right about the variable width.

According to the standard, UTF-8 has a maximum length of 4 bytes which
would fit into a int on most (all?) platforms so shifting would not be
necessary, I think.

I am not too familiar with C but wouldn't it theoretically be possible
to figure out the length of a UTF-8 sequence, cast only the sequence to
an int and use it to map into a sparse array of wchar_t/uint32_t's?

Obviously having a sparse array that is backed by only a fraction of the
actually requested memory would be crucial because UTF-8 allows 4 byte
sequences with almost all the most significant bits set.


Reply via email to