Thanks for the comments! On Tue, Nov 26, 2013 at 11:40 PM, Thorsten Glaser <t...@mirbsd.de> wrote: > Strake dixit: >>On 26/11/2013, Silvan Jegen <s.je...@gmail.com> wrote: >>> If you you would rather not take this version, what approach would >>> you take for the character set mapping when using UTF-8? >> >>On Linux, one can easily make a sparse array with 1-page granularity >>with mmap, and so simply use a (wchar_t []) or (Rune []), but I'm not >>sure how portable this is. > > Pretty portable, and 2²¹ * sizeof(wchar_t)/CHAR_BITS is at best 2²⁵ > or 32 MiB, so this would even work.
If I understand correctly you would use mmap to allocate a sparse memory area into which we could then directly index (either using UTF-8 or UTF-32 indices), right? Since mmap needs a file descriptor argument I would need a "typed memory object" for use with mmap which can be obtained by using http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_typed_mem_open.html . Those functions are POSIX so they should be reasonably portable I would assume. > But common, for Unicode, is to use the planes. > > struct { > wchar_t foo[0x100]; > } *repl[0x1100]; > > Do note that sizeof(wchar_t) may be 16, and that the OS’ own > representation of wchar_t may not be Unicode, so the type would > be semantically wrong. > > You might want to use uint32_t there. Sadly, I do not follow. I recognize that the lengths of those arrays multiplied correspond to the maximum number of Unicode code points (1,114,112) but I am not sure how the mapping (from UTF-8 or UTF-32 encoding) should be done. Care to enlighten me?