Hi, On a fully UTF-8 environment, in certain easily reproducible cases, the command line becomes messed up (the cursor ends up in the wrong column, and from then on it's tons of cumulative mistakes). I'm using bash-4.2.42, but the bug also affects other readline-based applications.
To reproduce: Save the attached file as your .bash_history. (Notice the UTF-8 accented characters in the middle line. Also notice the trailing space in each line.) Use either the Up arrow, or enter a common prefix of the lines and press a hotkey bound to history-search-backward. You'll see the command line getting messed up. Please read on for my findings on what goes on, and my proposed fix. The bug occurs only when the readline's insert_some_chars() method (the one which inserts columns pushing some characters to the right) is involved. It also needs UTF-8 characters to be present. My first finding was that display.c line 1684, namely: twidth = _rl_col_width (nfd+lendiff, 0, temp-col_lendiff, 1); is clearly broken. _rl_col_width() converts logical byte offsets into columns, and it is passed a column where a byte offset should be present, this is the typical case of "code smell", semantic incorrectness. Changing the argument from "temp-col_lendiff" to "temp-lendiff" fixes the case where you go back from the last history entry to the previous; however it breaks going back from that one to the first :) My second finding, trying to fix the newly introduced bug, was that the concept of the insert_some_chars() is broken in UTF-8 environment. Here's why. We know lendiff (the growth of the command line in bytes) and col_lendiff (the growth in columns). However, the code tries to print the new text in two runs: first the segment that corresponds to the newly created columns (this happens within insert_some_chars()), and then the ones that overwrite the old characters (in update_line()). In order to do that correctly we would need to know how many bytes would produce the given number of columns, (kinda the opposite of _rl_col_width()), but we have no method for that, so no wonder the code does it wrong. It's not even possible to have such a method, e.g. when simple English letter is be replaced by a CJK glyph then coldiff==1, so you shift the remaining characters to the right by 1, and you'd need to print the left half and the right half of that CJK character in separate steps. Also note that another problem with the source line quoted above is that nfd+lendiff might point to the middle of a UTF-8 character, and computing width from there doesn't make sense. Hence the fix would be to abandon the concept of printing new characters in two runs, separately for the new columns and separately for overwriting the old ones. The correct behavior can only be implemented if, after allocating the proper amount of columns, all the differing characters are printed in a single run. This is what my proposed patch does. insert_some_chars() is made simpler: it only inserts new colums, but doesn't print any text, that responsibility is given back to its caller update_line() which prints the whole stuff at the location where it used to print the first part only. Printing the second run is then simply removed. Please see the attached proof-of-concept patch. I've stress-tested it against my actual entries in my real .bash_history, with many mp3 filenames containing accented letters (and trailing space due to completion), and it seems to work correctly. Note that my patch does *not* address the non-UTF-8 branch, I believe I don't have the knowledge and understanding of the source to test all possible scenarios. The three "else" branches starting at line 1645 also have to be modified to print the whole difference, not just the part filling the new columns. I'm kindly asking you to do this piece of work, I guess it should be very easy for you based on my findings, and I'm sure you'll want to make cosmetic changes to my code anyway. Thanks very much, egmont
.bash_history
Description: Binary data
bash-4.2.42-cursor-position-proof-of-concept.patch
Description: Binary data