Re: From wchar_t to char32_t

Bruno Haible Thu, 06 Jul 2023 11:35:43 -0700

Paul Eggert wrote:
> I still see a couple of problems with it. First, it mishandles the case 
> where mbrtoc32 returns 0, which ISO C allows.


I thought that we could assume that no locale encoding maps a multibyte
sequence other than "\0" to (char32_t) 0. But OK, if you don't want to
assume that, it's easy to not assume it.

> Second and more interestingly, its "fwrite (tp0, 1, bytes, out);" could 
> output a byte string that represents multiple characters where the first 
> character fits in the output column width but the remaining characters 
> do not, and this would exceed the output column width.

This is a hypothetical scenario, because in most cases, the several
Unicode characters that come out of a multibyte sequence consist of a
base character (of width > 0) and one or more non-spacing marks
(of width == 0). But OK, let's make the minimum possible number of
assumptions...

> I suppose one could fix the second problem by not outputting such a byte 
> string, just as the code already suppresses the output of a byte string 
> representing a single character that occupies multiple columns 
> straddling the column border. That is, count all the columns of all the 
> characters that the byte string represents, before deciding whether to 
> output the byte string.

Indeed, this is the solution that makes no assumptions. Find a patch that
does it.

I had expected that the replacement mbrtowc -> mbrtoc32 would be purely
mechanical; I'm surprised that it requires application-specific considerations
here.

Also find attached a unit test. Just to verify that my change and future changes
make no gross mistake.

Bruno

From 3459a2b54a52152020d9208a9fc2a9efaf34897d Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Thu, 6 Jul 2023 20:20:44 +0200
Subject: [PATCH 1/2] tests: Add a side-by-side output test

* tests/side-by-side: New file.
* tests/Makefile.am (TESTS): Add it.
---
 tests/Makefile.am  |  1 +
 tests/side-by-side | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)
 create mode 100755 tests/side-by-side

diff --git a/tests/Makefile.am b/tests/Makefile.am
index 75d8f06..457c61e 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -22,6 +22,7 @@ TESTS = \
   new-file \
   no-dereference \
   no-newline-at-eof \
+  side-by-side \
   stdin \
   strcoll-0-names \
   filename-quoting \
diff --git a/tests/side-by-side b/tests/side-by-side
new file mode 100755
index 0000000..2c718d3
--- /dev/null
+++ b/tests/side-by-side
@@ -0,0 +1,46 @@
+#!/bin/sh
+# Test side-by-side output with non-ASCII characters.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+LC_ALL=en_US.UTF-8
+export LC_ALL
+test "`locale charmap 2>/dev/null`" = UTF-8 \
+  || skip_ "Locale $LC_ALL does not exist"
+
+cat >in1 <<\EOF || framework_failure_
+a
+ab
+abc
+[一天早上，當格雷戈爾·薩姆沙從不安的夢中醒來時，
+ 他發現自己在床上變成了一隻可怕的害蟲。
+]
+EOF
+
+cat >in2 <<\EOF || framework_failure_
+ab
+abcd
+<一天早上，當格雷戈爾·薩姆沙從不安的夢中醒來時，
+ 他發現自己在床上變成了一隻可怕的害蟲。
+>
+EOF
+
+cat >exp <<'EOF' || framework_failure_
+a                    <
+ab                      ab
+abc                  |  abcd
+[一天早上，當格雷戈  |  <一天早上，當格雷戈
+ 他發現自己在床上變      他發現自己在床上變
+]                    |  >
+EOF
+
+# We use option --expand-tabs, because the use of tabs hides horizontal
+# alignment bugs in some situations.
+
+returns_ 1 diff --expand-tabs -y -W 44 in1 in2 >out 2>err || fail=1
+compare exp out || fail=1
+compare /dev/null err || fail=1
+
+Exit $fail
-- 
2.34.1

>From 5d1878f25281f4b424af3f4d10fa8bdf9a66f3e3 Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Thu, 6 Jul 2023 20:21:14 +0200
Subject: [PATCH 2/2] diff: Improve handling of mbrtoc32 result

* src/side.c (print_half_line): When mbrtoc32 has left the mbstate not
in the initial state, continue calling mbrtoc32.
---
 src/side.c | 90 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 70 insertions(+), 20 deletions(-)

diff --git a/src/side.c b/src/side.c
index bf0e74b..840d199 100644
--- a/src/side.c
+++ b/src/side.c
@@ -141,34 +141,84 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound)
           break;
 
         default:
+          /* Invariant: mbstate is in the initial state here.  */
           {
-            char32_t wc;
-            size_t bytes = mbrtoc32 (&wc, tp0, text_limit - tp0, &mbstate);
+            char const *tp1 = tp0;
+            /* Sum of the widths of the 32-bit wide characters between
+               TP0 and TP1.  */
+            int width_sum = 0;
 
-	    if (bytes <= MB_LEN_MAX)
+            for (;;)
               {
-                int width = c32width (wc);
-		if (0 < width && ckd_add (&in_position, in_position, width))
-		  return out_position;
-                if (in_position <= out_bound)
+                /* Invariant: The multibyte sequences between TP0 and TP1
+                   have already been parsed, but not yet been output to OUT
+                   (because whether we can output it or not depends on the
+                   total width, and we can only determine the total width
+                   once we have arrived at a point where MBSTATE is in the
+                   initial state).  */
+                char32_t wc;
+                size_t bytes = mbrtoc32 (&wc, tp1, text_limit - tp1, &mbstate);
+
+                if (bytes < (size_t) -2)
                   {
-                    out_position = in_position;
-                    fwrite (tp0, 1, bytes, out);
+                    /* mbrtoc32 has produced a (or another) 32-bit wide
+                       character.  Add its width to WIDTH_SUM.  */
+                    int width = c32width (wc);
+                    if (width > 0)
+                      if (ckd_add (&width_sum, width_sum, width))
+                        return out_position;
+                    /* Advance TP1.  */
+                    if (bytes != (size_t) -3)
+                      tp1 += (bytes == 0 ? 1 : bytes);
                   }
-                text_pointer = tp0 + bytes;
 
-		/* Resume scanning for single-byte characters, as
-		   shift states are not supported.  */
-                break;
+                if (bytes >= (size_t) -2 || mbsinit (&mbstate))
+                  {
+                    /* Now see whether we have room for WIDTH_SUM columns,
+                       and if so, output the multibyte sequences between
+                       TP0 and TP1.  */
+                    if (tp0 < tp1)
+                      {
+                        if (ckd_add (&in_position, in_position, width_sum))
+                          return out_position;
+                        if (in_position <= out_bound)
+                          {
+                            out_position = in_position;
+                            fwrite (tp0, 1, tp1 - tp0, out);
+                          }
+                      }
+                    tp0 = tp1;
+
+                    if (bytes >= (size_t) -2)
+                      {
+                        /* An encoding error (bytes == (size_t) -1), as
+                           (size_t) -2 cannot happen as the buffer ends
+                           in '\n'.  */
+                        if (tp0 < text_limit)
+                          {
+                            /* Consume one byte.  Assume it has
+                               print width 1.  */
+                            if (ckd_add (&in_position, in_position, 1))
+                              return out_position;
+                            if (in_position <= out_bound)
+                              {
+                                out_position = in_position;
+                                putc (*tp0, out);
+                              }
+                            tp0++;
+                          }
+                        memset (&mbstate, '\0', sizeof mbstate);
+                      }
+
+                    /* We are done with this multibyte sequence.
+                       Return to the simpler processing in the outer loop.  */
+                    break;
+                  }
               }
           }
-
-	  /* An encoding error (bytes == (size_t) -1),
-	     as (size_t) -2 cannot happen as the buffer ends in '\n',
-	     and (size_t) -3 cannot happen on any known platform.
-	     Reset, and assume the error has print width 1.  */
-	  memset (&mbstate, 0, sizeof mbstate);
-          FALLTHROUGH;
+          /* Invariant: mbstate is in the initial state here again.  */
+          text_pointer = tp0;
+          break;
 
         /* Print width 1.  */
         case ' ': case '!': case '"': case '#': case '$': case '%':
-- 
2.34.1

Re: From wchar_t to char32_t

Reply via email to