Bruno Haible <bruno <at> clisp.org> writes: > > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html > > But before these techniques can be used in practice in packages such as > coreutils, two problems would have to be solved satisfactorily: > > 1) "George Pollard makes the assumption that the input string is valid UTF- 8". > This assumption cannot be upheld, as long as you use the same type > ('char *') for UTF-8 encoded strings and normal C strings, or when > you occasionally convert between one and the other.
Agreed. > > For example: Assume NAME is really a valid UTF-8 string. > A program then does > > static char buf[20]; > snprintf (buf, "%s", NAME); > utf8_strlen (buf); > > Boing! You already have a buffer overrun: Disagreed. Reread Colin Percival's vectorized algorithm - he intentionally checks for NUL before counting non-leading UTF-8 bytes. Yes, if any of the char* is not a valid UTF-8 character, the final count will be garbage. But snprintf guarantees a NUL, and the vectorized counter guarantees stopping at NUL; so the garbage is bounded: no greater than the number of bytes, and no less than the number of number of valid characters. > 2) We already have the problem that we want to keep good performance when > handling strings in the "C" locale or, more generally, in a unibyte locale. > So we get code duplication: > - code for unibyte locales, > - code for multibyte locales that uses mbrtowc(). > If you want to optimize UTF-8 locales particularly, i.e. optimize away > the function calls inherent in mbrtowc(), then we get code triplication: > - code for unibyte locales, > - code for UTF-8 locales, > - code for multibyte locales other than UTF-8, that uses mbrtowc(). > So, code size increases, and the testing requirements increase as well. Unfortunately true. But UTF-8 is such a common and special case that the benefits may outweigh the cost of duplication, especially if we can factor it well (you've already shown a factorization for writing one loop that can be used for unibyte and multibyte by merely swapping which header you include when compiling the loop). -- Eric Blake _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils