On Thu, Nov 28, 2013 at 07:01:17PM +0000, Thorsten Glaser wrote: > Silvan Jegen dixit: > > >If I understand correctly you would use mmap to allocate a sparse > >memory area into which we could then directly index (either using > >UTF-8 or UTF-32 indices), right? Since mmap needs a file descriptor > > I think that wouldn’t help much.
Intuitively I would say it should help quite a lot because we usually do not map more than a few characters using tr (well, at least I do) and thus have a very sparsely populated memory area. Implementing a mmap and a non-mmap version of the code and comparing the memory usage should not be too hard to do, however. > >Sadly, I do not follow. I recognize that the lengths of those arrays > >multiplied correspond to the maximum number of Unicode code points > >(1,114,112) but I am not sure how the mapping (from UTF-8 or UTF-32 > >encoding) should be done. Care to enlighten me? > > Eh, &0xFF and >>8? Bear with me for a moment, I am not used to bit twiddling :-) So your suggestion is to convert the UTF-8 to the Unicode code point (aka UTF-32) and use its value >>8-shifted as an index into an array of pointers to 255-member arrays of wchar_t's (or uint32_t's). The least significant byte of the UTF-32 encoded code point can then be extracted by using the bitwise AND operation with 0xFF and used as an index into the uint32_t/wchar_t array itself. That sounds reasonable but requires that we convert UTF-8 to UTF-32 which should not be strictly necessary when we only map one UTF-8 value to another. I wonder whether there's an easy solution that would not necessitate that conversion, but this may just be a premature optimization...