Hi, feeling hesitant to commit into ksh without at least one proper OK, i'm resending this patch here, sorry if i missed private feedback.
What the existing code does: It tries to make sure that multi-byte UTF-8 characters get passed on by the shell without fragmentation, not one byte at time. I heard people say that some software, for example tmux(1), may sometimes get confused when receiving a UTF-8 character in a piecemeal manner. Which problem needs fixing: Of the four-byte UTF-8 sequences, only a subset is identified by the existing code. The other four-byte UTF-8 sequences still get chopped up resulting in individual bytes being passed on. I'm also adding a few comments as suggested by jca@. Parsing of UTF-8 is less trivial than one might think, witnessed once again by the fact that i got this code wrong in the first place. I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f" for uniformity and readabilty - UTF-8-parsing is bad enough without needless micro-optimization, right? Note that even with the patch below, moving backward and forward over a blowfish icon on the command line still does not work because the character is width 2 and the ksh code intentionally does not use wcwidth(3). But maybe it improves something in tmux? Not sure. Either way, unless it causes regressions, this (or a further improved version) should go in because what is there is clearly wrong. OK? Ingo Index: emacs.c =================================================================== RCS file: /cvs/src/bin/ksh/emacs.c,v retrieving revision 1.87 diff -u -p -r1.87 emacs.c --- emacs.c 8 May 2020 14:30:42 -0000 1.87 +++ emacs.c 13 May 2021 18:16:13 -0000 @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off) return -1; buf[off++] = c; - if (c == 0xf4) + /* + * In the following, comments refer to violations of + * the inequality tests at the ends of the lines. + * See the utf8(7) manual page for details. + */ + + if ((c & 0xf8) == 0xf0 && c < 0xf5) /* beyond Unicode */ len = 4; else if ((c & 0xf0) == 0xe0) len = 3; - else if ((c & 0xe0) == 0xc0 && c > 0xc1) + else if ((c & 0xe0) == 0xc0 && c > 0xc1) /* use single byte */ len = 2; else len = 1; @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off) if (cc == -1) break; if (isu8cont(cc) == 0 || - (c == 0xe0 && len == 3 && cc < 0xa0) || - (c == 0xed && len == 3 && cc & 0x20) || - (c == 0xf4 && len == 4 && cc & 0x30)) { + (c == 0xe0 && len == 3 && cc < 0xa0) || /* use 2 bytes */ + (c == 0xed && len == 3 && cc > 0x9f) || /* surrogates */ + (c == 0xf0 && len == 4 && cc < 0x90) || /* use 3 bytes */ + (c == 0xf4 && len == 4 && cc > 0x8f)) { /* beyond Uni. */ x_e_ungetc(cc); break; }