Re: uc_width and wcwidth optimization

Alexander V. Lukyanov Wed, 14 Dec 2011 02:14:47 -0800

On Tue, Dec 13, 2011 at 11:32:53AM +0100, Bruno Haible wrote:
>   2) The wcwidth change is a good idea, but unfortunately is not multithread-
>      safe. Different threads can have different locales, therefore a global
>      variable as a cache won't lead to correct results always.


Fortunately charset.alias is not re-read every time wcwidth is called. ;-)

Are there any real programs which use different locales in threads?

> I'm attaching the benchmark program I'm experimenting with. So far, it seems
> that locale_charset() is really slow, whereas the is_cjk stuff is not a big
> speed problem.

is_cjk_encoding() is on the second place after locale_charset.

locale_charset is slow because of linear search of locale alias.

Unfortunately, I don't know how to optimize it to be thread-safe without
heavy artillery like thread-local storage.

> > Besides, uc_width is used in wcwidth for cjk encodings as designed.
> 
> -  if (STREQ (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
> +  if (cached_is_utf8_encoding || cached_is_cjk_encoding)
>      {
>        /* We assume that in a UTF-8 locale, a wide character is the same as a
>           Unicode character.  */
> -      return uc_width (wc, encoding);
> +      return uc_width (wc, cached_is_cjk_encoding);
>      }
> 
> This won't work portably: The comment says that only in UTF-8 locales we know
> that a wchar_t represents a Unicode character. In locales with encodings
> such as EUC-JP or GB18030 you cannot assume anything about how to libc has
> defined the wchar_t values.

It means that it is possible to avoid is_cjk_encoding() calling at all,
because uc_width only uses encoding for cjk checking and uc_width is only
called by wcwidth for UTF-8 case (which is not a cjk encoding).

-- 
   Alexander.

Re: uc_width and wcwidth optimization

Reply via email to