Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Ingo Schwarze Wed, 02 Jun 2021 12:00:56 -0700

Hi,

feeling hesitant to commit into ksh without at least one proper OK,
i'm resending this patch here, sorry if i missed private feedback.


What the existing code does:
It tries to make sure that multi-byte UTF-8 characters get passed on by
the shell without fragmentation, not one byte at time.  I heard people
say that some software, for example tmux(1), may sometimes get confused
when receiving a UTF-8 character in a piecemeal manner.

Which problem needs fixing:
Of the four-byte UTF-8 sequences, only a subset is identified by the
existing code.  The other four-byte UTF-8 sequences still get chopped
up resulting in individual bytes being passed on.


I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
is less trivial than one might think, witnessed once again by the fact
that i got this code wrong in the first place.

I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
for uniformity and readabilty - UTF-8-parsing is bad enough without
needless micro-optimization, right?


Note that even with the patch below, moving backward and forward
over a blowfish icon on the command line still does not work because
the character is width 2 and the ksh code intentionally does not
use wcwidth(3).  But maybe it improves something in tmux?  Not sure.

Either way, unless it causes regressions, this (or a further improved
version) should go in because what is there is clearly wrong.

OK?
  Ingo


Index: emacs.c
===================================================================
RCS file: /cvs/src/bin/ksh/emacs.c,v
retrieving revision 1.87
diff -u -p -r1.87 emacs.c
--- emacs.c     8 May 2020 14:30:42 -0000       1.87
+++ emacs.c     13 May 2021 18:16:13 -0000
@@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
                return -1;
        buf[off++] = c;
 
-       if (c == 0xf4)
+       /*
+        * In the following, comments refer to violations of
+        * the inequality tests at the ends of the lines.
+        * See the utf8(7) manual page for details.
+        */
+
+       if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */
                len = 4;
        else if ((c & 0xf0) == 0xe0)
                len = 3;
-       else if ((c & 0xe0) == 0xc0 && c > 0xc1)
+       else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
                len = 2;
        else
                len = 1;
@@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
                if (cc == -1)
                        break;
                if (isu8cont(cc) == 0 ||
-                   (c == 0xe0 && len == 3 && cc < 0xa0) ||
-                   (c == 0xed && len == 3 && cc & 0x20) ||
-                   (c == 0xf4 && len == 4 && cc & 0x30)) {
+                   (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
+                   (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
+                   (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
+                   (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
                        x_e_ungetc(cc);
                        break;
                }

Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Reply via email to