readline display bug with UTF-8 and char insertion

Egmont Koblinger Mon, 21 Jan 2013 12:17:43 -0800

Hi,

On a fully UTF-8 environment, in certain easily reproducible cases, the
command line becomes messed up (the cursor ends up in the wrong column, and
from then on it's tons of cumulative mistakes).  I'm using bash-4.2.42, but
the bug also affects other readline-based applications.


To reproduce:  Save the attached file as your .bash_history.  (Notice the
UTF-8 accented characters in the middle line.  Also notice the trailing
space in each line.)  Use either the Up arrow, or enter a common prefix of
the lines and press a hotkey bound to history-search-backward.  You'll see
the command line getting messed up.

Please read on for my findings on what goes on, and my proposed fix.

The bug occurs only when the readline's insert_some_chars() method (the one
which inserts columns pushing some characters to the right) is involved.
It also needs UTF-8 characters to be present.

My first finding was that display.c line 1684, namely:
  twidth = _rl_col_width (nfd+lendiff, 0, temp-col_lendiff, 1);
is clearly broken.  _rl_col_width() converts logical byte offsets into
columns, and it is passed a column where a byte offset should be present,
this is the typical case of "code smell", semantic incorrectness.  Changing
the argument from "temp-col_lendiff" to "temp-lendiff" fixes the case where
you go back from the last history entry to the previous; however it breaks
going back from that one to the first :)

My second finding, trying to fix the newly introduced bug, was that the
concept of the insert_some_chars() is broken in UTF-8 environment.  Here's
why.

We know lendiff (the growth of the command line in bytes) and col_lendiff
(the growth in columns).  However, the code tries to print the new text in
two runs: first the segment that corresponds to the newly created columns
(this happens within insert_some_chars()), and then the ones that overwrite
the old characters (in update_line()).  In order to do that correctly we
would need to know how many bytes would produce the given number of
columns, (kinda the opposite of _rl_col_width()), but we have no method for
that, so no wonder the code does it wrong.  It's not even possible to have
such a method, e.g. when simple English letter is be replaced by a CJK
glyph then coldiff==1, so you shift the remaining characters to the right
by 1, and you'd need to print the left half and the right half of that CJK
character in separate steps.  Also note that another problem with the
source line quoted above is that nfd+lendiff might point to the middle of a
UTF-8 character, and computing width from there doesn't make sense.

Hence the fix would be to abandon the concept of printing new characters in
two runs, separately for the new columns and separately for overwriting the
old ones.  The correct behavior can only be implemented if, after
allocating the proper amount of columns, all the differing characters are
printed in a single run.  This is what my proposed patch does.
insert_some_chars() is made simpler: it only inserts new colums, but
doesn't print any text, that responsibility is given back to its caller
update_line() which prints the whole stuff at the location where it used to
print the first part only.  Printing the second run is then simply removed.

Please see the attached proof-of-concept patch.  I've stress-tested it
against my actual entries in my real .bash_history, with many mp3 filenames
containing accented letters (and trailing space due to completion), and it
seems to work correctly.

Note that my patch does *not* address the non-UTF-8 branch, I believe I
don't have the knowledge and understanding of the source to test all
possible scenarios.  The three "else" branches starting at line 1645 also
have to be modified to print the whole difference, not just the part
filling the new columns.  I'm kindly asking you to do this piece of work, I
guess it should be very easy for you based on my findings, and I'm sure
you'll want to make cosmetic changes to my code anyway.

Thanks very much,

egmont

.bash_history
Description: Binary data

bash-4.2.42-cursor-position-proof-of-concept.patch
Description: Binary data

readline display bug with UTF-8 and char insertion

Reply via email to