Hi Walter,

On Sat, Apr 05, 2025 at 01:33:10PM +0200, Walter Alejandro Iglesias wrote:

> I discovered another issue (in the unpatched ksh).

Since i hate posting UTF-8 to mailing lists, let me paraphrase your
finding such that i do not need to quote your (excessively verbose)
example.

  Put ksh(1) into VI command line editing mode.
  Type a line containing any UTF-8 character (just a two-byte line
    consisting of a single character in enough).
  Put anything into the yank buffer (even a single ASCII character is enough).
  Put the cursor on any UTF-8 character (technically, that means the
    code implementating the shell has the cursor sitting on the first
    byte of that UTF-8 character, but that is an implementation detail
    not visible to the user).
  Issue the paste (p) command.

The result is that the content of the yank buffer is inserted into the
middle of the UTF-8 character.  Fur example, if you use the UTF-8
character 0xc3 0xa9 (e-accent aigu) and the paste buffer 0x61 (letter a),
then you get 0xc3 0x61 0xa9, which is not a valid UTF-8 string.

Consequently, i do confirm that you found a bug.


Walter Alejandro Iglesias wrote on Fri, Apr 11, 2025 at 12:18:22PM +0200:

> Please excuse me if I'm a bit chaotic.

On the one hand, often enough bad patches trigger good ones,
so do not panic.  On the other hand, understand trying to fix stuff
and sending bad patches as a valuable learning experience.

> I don't do this every day;

I don't do so any longer either; it has been a few years that i looked
at this code.

> I have trouble understanding the code at first.

It is indeed mostly devoid of comments.  On the other hand, much of it
is pretty straightforward.  So while a few comments here and there
might help to clarify some of the less obvious aspects, i don't
think we need comments all over the place.

> This seems to work:
>  
> Index: vi.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/vi.c,v
> diff -u -p -r1.60 vi.c
> --- vi.c      12 Mar 2021 02:10:25 -0000      1.60
> +++ vi.c      11 Apr 2025 09:59:40 -0000
> @@ -834,8 +834,12 @@ vi_cmd(int argcnt, const char *cmd)
>  
>               case 'p':

  /* What follows is the implementation of the paste (p) command. */
     (Clear enough without a comment.)

>                       modified = 1; hnum = hlast;
> -                     if (es->linelen != 0)
> +                     if (es->linelen != 0) {
>                               es->cursor++;

  /* This advances the cursor to the next byte.
     Needed because the subsequent putbuf() insertion inserts left
     of the cursor, but we want to insert to the right of it.
     You correctly recognized that the problem is that advancing
     by a single byte is not enough if we are sitting on the first
     byte of a UTF-8 character. */

> +                             while (!isascii(es->cbuf[es->cursor]) && 
> +                                 es->cursor < es->linelen)
> +                                     es->cursor++;

That, however, i totally wrong.  I conclude that from mere code inspection
and i'm not even going to test it.  Imagine we are sitting at the
beginning of a long string of UTF-8 characters.  This will hurry
past the whole long string because none of the bytes in this long
string is ASCII.

So instead of inserting after the current character and before the next
character, you would now skip potentially many UTF-8 characters and
insert before the next ASCII character, perhaps far to the right.


So what is needed instead?  This is a faily standard task:

  If we sit on a UTF-8 start byte, advance past the whole UTF-8
  character.

So what we need is

  while (es->cursor < es->linelen && isu8cont(es->cbuf[es->cursor]))
        es->cursor++;

But that can be expressed in a more compact way, see below.

A few lines down, the es->cursor-- that you do not change is also
wrong.  After the insertion, the curser sits after the insertion,
but we want it on the last letter of the insertion.

> +                     }
>                       while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
>                               ;
>                       if (es->cursor != 0)
> @@ -1195,8 +1199,7 @@ domove(int argcnt, const char *cmd, int 
>                       return -1;
>               ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
>               if (!sub)
> -                     while (isu8cont((unsigned char)es->cbuf[--ncursor]))
> -                             continue;
> +                     --ncursor;
>               break;

This change breaks the "move to end of word" (e) command.
It seems unrelated because your example only uses "ye", that is,
"e" as a subcommand, but your code change only changes the case
where "e" is stand-alone, i.e. *not* a subcommand.  Consequently,
i do not understand what you are trying to achieve with this change
in domove().

The following patch is not yet complete and only survived minimal,
but not yet sufficient testing.  For example, the 'P' command
right below obviously suffers from the same problem.

Even if i would get OKs for the following patch right now, i would
prefer waiting until after release.  We have clearly lived with this
bug fur years, so it is not release-critical, and the risk of breaking
the shell so close to release does not feel acceptable.

As a final remark, this patch illustrates what i said a few days ago.
The places in the code affected by these kinds of issues are not
easy to find.  There may be no more conspicious indications in the
code than an innocuous-looking ++ or -- operator.

Yours,
  Ingo


Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.60 vi.c
--- vi.c        12 Mar 2021 02:10:25 -0000      1.60
+++ vi.c        11 Apr 2025 17:59:01 -0000
@@ -834,12 +834,16 @@ vi_cmd(int argcnt, const char *cmd)
 
                case 'p':
                        modified = 1; hnum = hlast;
-                       if (es->linelen != 0)
-                               es->cursor++;
+                       /* Insert after the current character. */
+                       while (es->cursor < es->linelen)
+                               if (!isu8cont(es->cbuf[++es->cursor]))
+                                       break;
                        while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
                                ;
-                       if (es->cursor != 0)
-                               es->cursor--;
+                       /* Back up to the last character inserted. */
+                       while (es->cursor != 0)
+                               if (!isu8cont(es->cbuf[--es->cursor]))
+                                       break;
                        if (argcnt != 0)
                                return -1;
                        break;

Reply via email to