n Fri, 23 Feb 2024, Chet Ramey wrote: > On 2/19/24 9:26 PM, Avid Seeker wrote: > > When pressing backspace on Arabic ligatures (including characters with > > diacritics), they are removed as if they are one character. > > As you might guess, readline doesn't know much about Arabic, per se. In a > UTF-8 locale, for example, it knows base characters and combining > characters. > > The idea is simple: when moving backwards, move one multibyte character at > a time, ignoring combining characters, until you get to a character for > which wcwidth(x) > 0, and move point there. The algorithm for moving > forward is similar. > > How should this be modified to support Arabic in a portable way?
Unicode has categories for "modifiers" (especially "modifier letters") and for "combining characters". Note that each symbol can be in multiple categories. Modifiers change how another character is displayed. They may or may not be considered to have their own separate semantic meaning. In the simple cases they simply over-print an additional mark, but more complex adjustments are possible. They don't normally change the overall size of the modified character, so wcwidth(ch) will report zero. What matters is that "combining characters" do not have stand-alone semantic meaning; they should be erased along with the principal character. Accents in European languages (and Thai) tend to be in this category. To a first approximation, backspace should skip over the latter but not the former. However if you've just removed a zero-width element, it would be advisable to either re-render the whole line, or backspace over the last full glyph, erase it, and re-render it with all its (remaining) modifiers. https://stackoverflow.com/questions/54450823/what-is-the-difference-between-combining-characters-and-modifier-letters On systems that need to dynamically load a shared library (linunicode.so?) to support this, I suggest delaying doing so until it's needed -- after setlocale("something.UTF-8") returns success, or some equivalent test. (I hope there's a check that can be done against the already-loaded locale, rather than inspecting the locale name as a string.) -Martin