Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Jeremie Courreges-Anglas Thu, 03 Jun 2021 14:17:48 -0700

On Wed, Jun 02 2021, Ingo Schwarze <schwa...@usta.de> wrote:
> Hi,
>
> feeling hesitant to commit into ksh without at least one proper OK,
> i'm resending this patch here, sorry if i missed private feedback.


I found two mails in my drafts folder, sorry, you didn't miss any
feedback from me.

> What the existing code does:
> It tries to make sure that multi-byte UTF-8 characters get passed on by
> the shell without fragmentation, not one byte at time.  I heard people
> say that some software, for example tmux(1), may sometimes get confused
> when receiving a UTF-8 character in a piecemeal manner.
>
> Which problem needs fixing:
> Of the four-byte UTF-8 sequences, only a subset is identified by the
> existing code.  The other four-byte UTF-8 sequences still get chopped
> up resulting in individual bytes being passed on.
>
>
> I'm also adding a few comments as suggested by jca@.  Parsing of UTF-8
> is less trivial than one might think, witnessed once again by the fact
> that i got this code wrong in the first place.

Thanks for this, though I was mostly interested into a pointer to the
relevant standard document that was used.  Now that the checks match the
Unicode standard, I'm less confused.  You're now pointing readers to the
utf8(7) manpage.  Discussion for another diff: should this manpage list
valid sequences instead of invalid ones?

> I also changed "cc & 0x20" to "cc > 0x9f" and "cc & 0x30" to "cc > 0x8f"
> for uniformity and readabilty - UTF-8-parsing is bad enough without
> needless micro-optimization, right?

I don't see a reason to optimize this.  IMO the checks should look just
like in your proposal, which matches the table in section "Unicode
Encoding Forms" in UnicodeStandard-13.0.pdf.

> Note that even with the patch below, moving backward and forward
> over a blowfish icon on the command line still does not work because
> the character is width 2 and the ksh code intentionally does not
> use wcwidth(3).  But maybe it improves something in tmux?  Not sure.
>
> Either way, unless it causes regressions, this (or a further improved
> version) should go in because what is there is clearly wrong.
>
> OK?

ok jca@

>   Ingo
>
> Index: emacs.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -p -r1.87 emacs.c
> --- emacs.c   8 May 2020 14:30:42 -0000       1.87
> +++ emacs.c   13 May 2021 18:16:13 -0000
> @@ -1851,11 +1851,17 @@ x_e_getu8(char *buf, int off)
>               return -1;
>       buf[off++] = c;
>  
> -     if (c == 0xf4)
> +     /*
> +      * In the following, comments refer to violations of
> +      * the inequality tests at the ends of the lines.
> +      * See the utf8(7) manual page for details.
> +      */
> +
> +     if ((c & 0xf8) == 0xf0 && c < 0xf5)  /* beyond Unicode */
>               len = 4;
>       else if ((c & 0xf0) == 0xe0)
>               len = 3;
> -     else if ((c & 0xe0) == 0xc0 && c > 0xc1)
> +     else if ((c & 0xe0) == 0xc0 && c > 0xc1)  /* use single byte */
>               len = 2;
>       else
>               len = 1;
> @@ -1865,9 +1871,10 @@ x_e_getu8(char *buf, int off)
>               if (cc == -1)
>                       break;
>               if (isu8cont(cc) == 0 ||
> -                 (c == 0xe0 && len == 3 && cc < 0xa0) ||
> -                 (c == 0xed && len == 3 && cc & 0x20) ||
> -                 (c == 0xf4 && len == 4 && cc & 0x30)) {
> +                 (c == 0xe0 && len == 3 && cc < 0xa0) ||  /* use 2 bytes */
> +                 (c == 0xed && len == 3 && cc > 0x9f) ||  /* surrogates  */
> +                 (c == 0xf0 && len == 4 && cc < 0x90) ||  /* use 3 bytes */
> +                 (c == 0xf4 && len == 4 && cc > 0x8f)) {  /* beyond Uni. */
>                       x_e_ungetc(cc);
>                       break;
>               }
>

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

Re: Patch: ksh: fix input handling for 4 byte UTF-8 sequences

Reply via email to