On Wed, Mar 16, 2011 at 5:39 AM, Mike Gran <spk...@yahoo.com> wrote: >> From:Mark H Weaver <m...@netris.org> >> >> Mike Gran <spk...@yahoo.com> writes: >> > We do, in a matter of speaking, have a single string representation: >> > UTF-32. The 'narrow' encoding is UTF-32 with the initial 3 bytes >> of >> > zero removed. >> >> Despite the similarity of these two representations, they are >> sufficiently different that they cannot be handled by the same machine >> code. That means you must either implement multiple inner loops, one >> for each combination of string parameter representations, or else you >> must dispatch on the string representation within the inner loop. On >> modern architectures, wrongly predicted conditional branches are very >> expensive. > > Keep in mind that the UTF-8 forward iterator operation has conditional > branches. Merely the act of advancing from one character to another > could take one of four paths, or more if you include the possibility > of invalid UTF-8 sequences.
No, technically you don't need any branching: /* first-byte lookup table encoded as an integer */ #define magic 3841982464uL /* read a utf8 char and advance the pointer */ int read_utf8_char(char **pp) { char *p = *pp; int tmp, len, res = (unsigned char)*p++; tmp = (res>>7); /* tmp is 0 if len is 1, 1 otherwise */ len = (res>>4)*2; len = tmp*(((magic&(3<<len))>>len)+1) + (1-tmp); res = ((res<<len)&255)>>len; res = (tmp*((res<<6)+(63&*p))) + (1-tmp)*res; p += tmp; tmp = (len-1)>>1; /* tmp is 1 if len is 3,4 0 otherwise */ res = (tmp*((res<<6)+(63&*p))) + (1-tmp)*res; p += tmp; tmp = len>>2; /* tmp is 1 if len is 4, 0 otherwise */ res = (tmp*((res<<6)+(63&*p))) + (1-tmp)*res; p += tmp; *pp = p; return res; } It turns out this isn't worth it on x86 architectures. Branch prediction is very fast. Either way, the overhead tends to be dwarfed by whatever it is you're doing _with_ the char. -- Alex