On Tue, Dec 13, 2011 at 11:32:53AM +0100, Bruno Haible wrote: > 2) The wcwidth change is a good idea, but unfortunately is not multithread- > safe. Different threads can have different locales, therefore a global > variable as a cache won't lead to correct results always.
Fortunately charset.alias is not re-read every time wcwidth is called. ;-) Are there any real programs which use different locales in threads? > I'm attaching the benchmark program I'm experimenting with. So far, it seems > that locale_charset() is really slow, whereas the is_cjk stuff is not a big > speed problem. is_cjk_encoding() is on the second place after locale_charset. locale_charset is slow because of linear search of locale alias. Unfortunately, I don't know how to optimize it to be thread-safe without heavy artillery like thread-local storage. > > Besides, uc_width is used in wcwidth for cjk encodings as designed. > > - if (STREQ (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0)) > + if (cached_is_utf8_encoding || cached_is_cjk_encoding) > { > /* We assume that in a UTF-8 locale, a wide character is the same as a > Unicode character. */ > - return uc_width (wc, encoding); > + return uc_width (wc, cached_is_cjk_encoding); > } > > This won't work portably: The comment says that only in UTF-8 locales we know > that a wchar_t represents a Unicode character. In locales with encodings > such as EUC-JP or GB18030 you cannot assume anything about how to libc has > defined the wchar_t values. It means that it is possible to avoid is_cjk_encoding() calling at all, because uc_width only uses encoding for cjk checking and uc_width is only called by wcwidth for UTF-8 case (which is not a cjk encoding). -- Alexander.