On 23/02/2022 10:58, JD wrote:
Hi!

I have fmt from coreutils 8.32.1 installed via MacPorts.

If I run the following command: `echo х х х х х х х х х х х х х х х х х х х х х 
х х х х х | gfmt -sw 10` (which is just echoing 26 Cyrillic 'х' ('kha') 
letters), I get the following results:

https://i.imgur.com/yRx7uuz.png (iTerm2)
https://i.imgur.com/7oQ0UPz.png (iTerm2 if passed via `more`)
https://i.imgur.com/UlLrEMy.png (Alacritty)

And if I delete just two 'х' letters, like this: `echo х х х х х х х х х х х х 
х х х х х х х х х х х х | gfmt -sw 10`, evertyhitng shows just fine: 
https://i.imgur.com/DwuWxyx.png

Would be grateful for any advice :)

The issue here is that (on macOS 10.15.7 at least),
isspace(0x85) returns true for UTF-8 locales
(but not for "C" or "iso8859-1" locales).
BTW iscntrl() returns true for 0x85 on all non C locales
on both Linux and macOS.

Now gnulib says wrt isspace() that:

"This function's behaviour depends on the locale, but does not support
the multibyte characters that occur in strings in locales with
@code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales)."

I think isspace(x85) returning true on macOS is a bug,
but we should probably avoid isspace() in fmt altogether
given it's inconsistency with multibyte locales.
The attached uses c_isspace() instead.

cheers,
Pádraig
From 166b6783bc1a6e0ce206114c1d593c2528e3cfa1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com>
Date: Wed, 23 Feb 2022 17:50:46 +0000
Subject: [PATCH] fmt: fix invalid multi-byte splitting on macOS

On macOS, isspace(0x85) returns true,
which results in splitting within multi-byte characters.

* src/fmt.c (get_line): s/isspace/c_isspace/.
* tests/fmt/non-space.sh: Add a new test.
* tests/local.mk: Reference new test.
* NEWS: Mention the fix.
Addresses https://bugs.gnu.org/54124
---
 NEWS                   |  4 ++++
 src/fmt.c              |  3 ++-
 tests/fmt/non-space.sh | 49 ++++++++++++++++++++++++++++++++++++++++++
 tests/local.mk         |  3 ++-
 4 files changed, 57 insertions(+), 2 deletions(-)
 create mode 100755 tests/fmt/non-space.sh

diff --git a/NEWS b/NEWS
index ef65b4ab8..35d9a50dd 100644
--- a/NEWS
+++ b/NEWS
@@ -21,6 +21,10 @@ GNU coreutils NEWS                                    -*- outline -*-
   and B is in some other file system.
   [bug introduced in coreutils-9.0]
 
+  On macOS, fmt no longer corrupts multi-byte characters
+  by misdetecting their component bytes as spaces.
+  [This bug was present in "the beginning".]
+
   'id xyz' now uses the name 'xyz' to determine groups, instead of xyz's uid.
   [bug introduced in coreutils-8.22]
 
diff --git a/src/fmt.c b/src/fmt.c
index 1eb7019b0..05bafabd6 100644
--- a/src/fmt.c
+++ b/src/fmt.c
@@ -26,6 +26,7 @@
    it to be a type get syntax errors for the variable declaration below.  */
 #define word unused_word_type
 
+#include "c-ctype.h"
 #include "system.h"
 #include "error.h"
 #include "die.h"
@@ -702,7 +703,7 @@ get_line (FILE *f, int c)
           *wptr++ = c;
           c = getc (f);
         }
-      while (c != EOF && !isspace (c));
+      while (c != EOF && !c_isspace (c));
       in_column += word_limit->length = wptr - word_limit->text;
       check_punctuation (word_limit);
 
diff --git a/tests/fmt/non-space.sh b/tests/fmt/non-space.sh
new file mode 100755
index 000000000..b59838983
--- /dev/null
+++ b/tests/fmt/non-space.sh
@@ -0,0 +1,49 @@
+#!/bin/sh
+# Test fmt space handling
+
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ fmt printf
+
+# Before coreutils 9.1 macOS treated bytes like 0x85
+# as space characters in multi-byte locales (including UTF-8)
+
+check_non_space() {
+  char="$1"
+  test "$(env printf "=$char=" | fmt -s -w1 | wc -l)" = 1 || fail=1
+}
+
+export LC_ALL=en_US.iso8859-1  # only lowercase form works on macOS 10.15.7
+if test "$(locale charmap 2>/dev/null | sed 's/iso/ISO-/')" = ISO-8859-1; then
+  check_non_space '\xA0'
+fi
+
+export LC_ALL=en_US.UTF-8
+if test "$(locale charmap 2>/dev/null)" = UTF-8; then
+  check_non_space '\u00A0'  # No break space
+  check_non_space '\u2007'  # TODO: should probably split on figure space
+  check_non_space '\u202F'  # Narrow no break space
+  check_non_space '\u2060'  # zero-width no break space
+  check_non_space '\u0445'  # Cyrillic kha has 0x85, which macOS isspace()=true
+fi
+
+export LC_ALL=ru_RU.KOI8-R
+if test "$(locale charmap 2>/dev/null)" = KOI8-R; then
+  check_non_space '\x9A'
+fi
+
+Exit $fail
diff --git a/tests/local.mk b/tests/local.mk
index f1376fb71..f97ddcb98 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -237,8 +237,9 @@ all_tests =					\
   tests/chgrp/posix-H.sh			\
   tests/chgrp/recurse.sh			\
   tests/fmt/base.pl				\
-  tests/fmt/long-line.sh			\
   tests/fmt/goal-option.sh			\
+  tests/fmt/long-line.sh			\
+  tests/fmt/non-space.sh			\
   tests/misc/echo.sh				\
   tests/misc/env.sh				\
   tests/misc/env-signal-handler.sh		\
-- 
2.26.2

Reply via email to