Paul Eggert wrote: > I still see a couple of problems with it. First, it mishandles the case > where mbrtoc32 returns 0, which ISO C allows.
I thought that we could assume that no locale encoding maps a multibyte sequence other than "\0" to (char32_t) 0. But OK, if you don't want to assume that, it's easy to not assume it. > Second and more interestingly, its "fwrite (tp0, 1, bytes, out);" could > output a byte string that represents multiple characters where the first > character fits in the output column width but the remaining characters > do not, and this would exceed the output column width. This is a hypothetical scenario, because in most cases, the several Unicode characters that come out of a multibyte sequence consist of a base character (of width > 0) and one or more non-spacing marks (of width == 0). But OK, let's make the minimum possible number of assumptions... > I suppose one could fix the second problem by not outputting such a byte > string, just as the code already suppresses the output of a byte string > representing a single character that occupies multiple columns > straddling the column border. That is, count all the columns of all the > characters that the byte string represents, before deciding whether to > output the byte string. Indeed, this is the solution that makes no assumptions. Find a patch that does it. I had expected that the replacement mbrtowc -> mbrtoc32 would be purely mechanical; I'm surprised that it requires application-specific considerations here. Also find attached a unit test. Just to verify that my change and future changes make no gross mistake. Bruno
From 3459a2b54a52152020d9208a9fc2a9efaf34897d Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Thu, 6 Jul 2023 20:20:44 +0200 Subject: [PATCH 1/2] tests: Add a side-by-side output test * tests/side-by-side: New file. * tests/Makefile.am (TESTS): Add it. --- tests/Makefile.am | 1 + tests/side-by-side | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+) create mode 100755 tests/side-by-side diff --git a/tests/Makefile.am b/tests/Makefile.am index 75d8f06..457c61e 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -22,6 +22,7 @@ TESTS = \ new-file \ no-dereference \ no-newline-at-eof \ + side-by-side \ stdin \ strcoll-0-names \ filename-quoting \ diff --git a/tests/side-by-side b/tests/side-by-side new file mode 100755 index 0000000..2c718d3 --- /dev/null +++ b/tests/side-by-side @@ -0,0 +1,46 @@ +#!/bin/sh +# Test side-by-side output with non-ASCII characters. + +. "${srcdir=.}/init.sh"; path_prepend_ ../src + +fail=0 + +LC_ALL=en_US.UTF-8 +export LC_ALL +test "`locale charmap 2>/dev/null`" = UTF-8 \ + || skip_ "Locale $LC_ALL does not exist" + +cat >in1 <<\EOF || framework_failure_ +a +ab +abc +[一天早上,當格雷戈爾·薩姆沙從不安的夢中醒來時, + 他發現自己在床上變成了一隻可怕的害蟲。 +] +EOF + +cat >in2 <<\EOF || framework_failure_ +ab +abcd +<一天早上,當格雷戈爾·薩姆沙從不安的夢中醒來時, + 他發現自己在床上變成了一隻可怕的害蟲。 +> +EOF + +cat >exp <<'EOF' || framework_failure_ +a < +ab ab +abc | abcd +[一天早上,當格雷戈 | <一天早上,當格雷戈 + 他發現自己在床上變 他發現自己在床上變 +] | > +EOF + +# We use option --expand-tabs, because the use of tabs hides horizontal +# alignment bugs in some situations. + +returns_ 1 diff --expand-tabs -y -W 44 in1 in2 >out 2>err || fail=1 +compare exp out || fail=1 +compare /dev/null err || fail=1 + +Exit $fail -- 2.34.1
>From 5d1878f25281f4b424af3f4d10fa8bdf9a66f3e3 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Thu, 6 Jul 2023 20:21:14 +0200 Subject: [PATCH 2/2] diff: Improve handling of mbrtoc32 result * src/side.c (print_half_line): When mbrtoc32 has left the mbstate not in the initial state, continue calling mbrtoc32. --- src/side.c | 90 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 70 insertions(+), 20 deletions(-) diff --git a/src/side.c b/src/side.c index bf0e74b..840d199 100644 --- a/src/side.c +++ b/src/side.c @@ -141,34 +141,84 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound) break; default: + /* Invariant: mbstate is in the initial state here. */ { - char32_t wc; - size_t bytes = mbrtoc32 (&wc, tp0, text_limit - tp0, &mbstate); + char const *tp1 = tp0; + /* Sum of the widths of the 32-bit wide characters between + TP0 and TP1. */ + int width_sum = 0; - if (bytes <= MB_LEN_MAX) + for (;;) { - int width = c32width (wc); - if (0 < width && ckd_add (&in_position, in_position, width)) - return out_position; - if (in_position <= out_bound) + /* Invariant: The multibyte sequences between TP0 and TP1 + have already been parsed, but not yet been output to OUT + (because whether we can output it or not depends on the + total width, and we can only determine the total width + once we have arrived at a point where MBSTATE is in the + initial state). */ + char32_t wc; + size_t bytes = mbrtoc32 (&wc, tp1, text_limit - tp1, &mbstate); + + if (bytes < (size_t) -2) { - out_position = in_position; - fwrite (tp0, 1, bytes, out); + /* mbrtoc32 has produced a (or another) 32-bit wide + character. Add its width to WIDTH_SUM. */ + int width = c32width (wc); + if (width > 0) + if (ckd_add (&width_sum, width_sum, width)) + return out_position; + /* Advance TP1. */ + if (bytes != (size_t) -3) + tp1 += (bytes == 0 ? 1 : bytes); } - text_pointer = tp0 + bytes; - /* Resume scanning for single-byte characters, as - shift states are not supported. */ - break; + if (bytes >= (size_t) -2 || mbsinit (&mbstate)) + { + /* Now see whether we have room for WIDTH_SUM columns, + and if so, output the multibyte sequences between + TP0 and TP1. */ + if (tp0 < tp1) + { + if (ckd_add (&in_position, in_position, width_sum)) + return out_position; + if (in_position <= out_bound) + { + out_position = in_position; + fwrite (tp0, 1, tp1 - tp0, out); + } + } + tp0 = tp1; + + if (bytes >= (size_t) -2) + { + /* An encoding error (bytes == (size_t) -1), as + (size_t) -2 cannot happen as the buffer ends + in '\n'. */ + if (tp0 < text_limit) + { + /* Consume one byte. Assume it has + print width 1. */ + if (ckd_add (&in_position, in_position, 1)) + return out_position; + if (in_position <= out_bound) + { + out_position = in_position; + putc (*tp0, out); + } + tp0++; + } + memset (&mbstate, '\0', sizeof mbstate); + } + + /* We are done with this multibyte sequence. + Return to the simpler processing in the outer loop. */ + break; + } } } - - /* An encoding error (bytes == (size_t) -1), - as (size_t) -2 cannot happen as the buffer ends in '\n', - and (size_t) -3 cannot happen on any known platform. - Reset, and assume the error has print width 1. */ - memset (&mbstate, 0, sizeof mbstate); - FALLTHROUGH; + /* Invariant: mbstate is in the initial state here again. */ + text_pointer = tp0; + break; /* Print width 1. */ case ' ': case '!': case '"': case '#': case '$': case '%': -- 2.34.1