On Thu, Nov 28, 2013 at 01:24:40PM -0500, Strake wrote: > [..] > > > UTF-32 is an encoding that is identical to the unicode point as far as > > I know. So what I am thinking is that one would either use the UTF-8 > > representation of the Unicode point as an index, or the unicode point > > itself. Since using UTF-8 would not require any conversion (on UTF-8 > > locales) I think it would be preferrable. > > UTF-8 has variable width, so one must find the length of the sequence > anyhow and shift it bytewise into an integer, so one may as well just > use fgetwc or the like and work with codepoints.
You are right about the variable width. According to the standard, UTF-8 has a maximum length of 4 bytes which would fit into a int on most (all?) platforms so shifting would not be necessary, I think. I am not too familiar with C but wouldn't it theoretically be possible to figure out the length of a UTF-8 sequence, cast only the sequence to an int and use it to map into a sparse array of wchar_t/uint32_t's? Obviously having a sparse array that is backed by only a fraction of the actually requested memory would be crucial because UTF-8 allows 4 byte sequences with almost all the most significant bits set.