On 28/11/2013, Silvan Jegen <s.je...@gmail.com> wrote: > On Thu, Nov 28, 2013 at 11:45:33AM -0500, Strake wrote: >> > (either using UTF-8 or UTF-32 indices), right? >> >> I meant Unicodepoints; those are just Unicodecs. > > UTF-32 is an encoding that is identical to the unicode point as far as > I know. So what I am thinking is that one would either use the UTF-8 > representation of the Unicode point as an index, or the unicode point > itself. Since using UTF-8 would not require any conversion (on UTF-8 > locales) I think it would be preferrable.
UTF-8 has variable width, so one must find the length of the sequence anyhow and shift it bytewise into an integer, so one may as well just use fgetwc or the like and work with codepoints.