Re: mbswidth "failure" on Solaris

Kiyoshi KANAZAWA Sun, 05 May 2019 09:01:28 -0700

Hello,

I confirmed that
bison test 127. Tabulations and multibyte characters (for Maxwell's equations)
passed with the patch for m4/wcwidth.m4.


Regards,

--- Kiyoshi




----- Original Message -----
> From: Bruno Haible <br...@clisp.org>
> To: bug-gnulib@gnu.org
> Cc: Akim Demaille <akim.demai...@gmail.com>; Kiyoshi KANAZAWA 
> <yoi_no_myou...@yahoo.co.jp>
> Date: 2019/5/5, Sun 20:35
> Subject: Re: mbswidth "failure" on Solaris
> 
> Hi,
> 
>>  >     15 | e: {∇⃗×𝐸⃗ = -∂𝐵⃗/∂t}
>>  > -      |    ^~~~~~~~~~~~~~
>>  > +      |    ^~~~~~~~~~~~~~~~~
> 
> Indeed, mbswidth seems to have returned 3 more columns.
> 
>>  The error (three more columns than expected) seems to indicate something
>>  related to the combining arrow.
> 
> No. The issue comes from the math symbols. The following test programs shows
> it:
> 
> #include <config.h>
> #include <stdio.h>
> #include <locale.h>
> #include <wchar.h>
> #include "mbswidth.h"
> int main ()
> {
>   setlocale (LC_ALL, "en_US.UTF-8");
>   printf ("%d\n", (int) mbswidth ("{∇⃗×𝐸⃗ = 
> -∂𝐵⃗/∂t}",0)); // 14 vs 17
>   printf ("%d\n", wcwidth (0x2207)); // 1 vs. 2
>   printf ("%d\n", wcwidth (0x20D7)); // 0
>   printf ("%d\n", wcwidth (0x00D7)); // 1
>   printf ("%d\n", wcwidth (0x1D438)); // 1
>   printf ("%d\n", wcwidth (0x2202)); // 1 vs. 2
>   printf ("%d\n", wcwidth (0x1D435)); // 1
> }
> 
> The following patch should fix it.
> 
> The patch changes the behaviour of wcwidth(0x2202) for UTF-8 locales.
> It would be possible to limit the change to the non-East-Asian UTF-8
> locales (by using the function uc_locale_language() and testing
> whether its result is not one of "zh", "ja", 
> "ko"), but glibc does not
> do this (it uses the same width across all UTF-8 locales), therefore
> I'm not doing it here either.
> 
> 
> 2019-05-05  Bruno Haible  <br...@clisp.org>
> 
>     wcwidth: Ensure width 1, not 2, for ambiguous characters.
>     Reported by Kiyoshi KANAZAWA <yoi_no_myou...@yahoo.co.jp>
>     via Akim Demaille <akim.demai...@gmail.com>.
>     * m4/wcwidth.m4 (gl_FUNC_WCWIDTH): Check the width of U+2202. Use an
>     en_US.UTF-8 locale, since that is more likely to be present than an
>     fr_FR.UTF-8 locale.
>     * tests/test-wcwidth.c (main): Check the width of U+2202.
>     * doc/posix-functions/wcwidth.texi: Mention the issue.
> 
> diff --git a/m4/wcwidth.m4 b/m4/wcwidth.m4
> index 3952fd2..e9b5bf4 100644
> --- a/m4/wcwidth.m4
> +++ b/m4/wcwidth.m4
> @@ -1,4 +1,4 @@
> -# wcwidth.m4 serial 28
> +# wcwidth.m4 serial 29
> dnl Copyright (C) 2006-2019 Free Software Foundation, Inc.
> dnl This file is free software; the Free Software Foundation
> dnl gives unlimited permission to copy and/or distribute it,
> @@ -54,6 +54,8 @@ AC_DEFUN([gl_FUNC_WCWIDTH],
>      dnl On OSF/1 5.1, wcwidth(0x200B) (ZERO WIDTH SPACE) returns 1.
>      dnl On OpenBSD 5.8, wcwidth(0xFF1A) (FULLWIDTH COLON) returns 0.
>      dnl This leads to bugs in 'ls' (coreutils).
> +    dnl On Solaris 11.4, wcwidth(0x2202) (PARTIAL DIFFERENTIAL) returns 2,
> +    dnl even in Western locales.
>      AC_CACHE_CHECK([whether wcwidth works reasonably in UTF-8 locales],
>        [gl_cv_func_wcwidth_works],
>        [
> @@ -80,7 +82,7 @@ int wcwidth (int);
> int main ()
> {
>    int result = 0;
> -  if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL)
> +  if (setlocale (LC_ALL, "en_US.UTF-8") != NULL)
>      {
>        if (wcwidth (0x0301) > 0)
>          result |= 1;
> @@ -90,6 +92,8 @@ int main ()
>          result |= 4;
>        if (wcwidth (0xFF1A) == 0)
>          result |= 8;
> +      if (wcwidth (0x2202) > 1)
> +        result |= 16;
>      }
>    return result;
> }]])],
> diff --git a/tests/test-wcwidth.c b/tests/test-wcwidth.c
> index eb7bdd2..8e9cea3 100644
> --- a/tests/test-wcwidth.c
> +++ b/tests/test-wcwidth.c
> @@ -72,6 +72,22 @@ main ()
>        ASSERT (wcwidth (0x200B) == 0);
>        ASSERT (wcwidth (0xFEFF) <= 0);
> 
> +      /* Test width of some math symbols.
> +         U+2202 is marked as having ambiguous width (A) in EastAsianWidth.txt
> +         (see <https://www.unicode.org/Public/12.0.0/ucd/EastAsianWidth.txt 
>> ).
> +         The Unicode Standard Annex 11
> +         <https://www.unicode.org/reports/tr11/tr11-36.html >
> +         says
> +           "Ambiguous characters behave like wide or narrow characters
> +            depending on the context (language tag, script identification,
> +            associated font, source of data, or explicit markup; all can
> +            provide the context). If the context cannot be established
> +            reliably, they should be treated as narrow characters by 
> default."
> +         For wcwidth(), the only available context information is the locale.
> +         "fr_FR.UTF-8" is a Western locale, not an East Asian locale, 
> therefore
> +         U+2202 should be treated like a narrow character.  */
> +      ASSERT (wcwidth (0x2202) == 1);
> +
>        /* Test width of some CJK characters.  */
>        ASSERT (wcwidth (0x3000) == 2);
>        ASSERT (wcwidth (0xB250) == 2);
> diff --git a/doc/posix-functions/wcwidth.texi 
> b/doc/posix-functions/wcwidth.texi
> index 741be8e..ecdf758 100644
> --- a/doc/posix-functions/wcwidth.texi
> +++ b/doc/posix-functions/wcwidth.texi
> @@ -18,6 +18,10 @@ glibc 2.8.
> This function handles combining characters in UTF-8 locales incorrectly on 
> some
> platforms:
> Mac OS X 10.3, OpenBSD 5.8.
> +@item
> +This function returns 2 for characters with ambiguous east asian width, even 
> in
> +Western locales, on some platforms:
> +Solaris 11.4.
> @end itemize
> 
> Portability problems not fixed by Gnulib:
>

Re: mbswidth "failure" on Solaris

Reply via email to