On 2/25/24 5:37 AM, Martin D Kealey wrote:
Unicode has categories for "modifiers" (especially "modifier letters") and for "combining characters". Note that each symbol can be in multiple categories.Modifiers change how another character is displayed. They may or may not be considered to have their own separate semantic meaning. In the simple cases they simply over-print an additional mark, but more complex adjustments are possible. They don't normally change the overall size of the modified character, so wcwidth(ch) will report zero.
Not quite, as I understand it in the sense Unicode uses them. Unicode modifier characters are base characters in their own right, are not combining characters, and do not change the graphical representation of the base character they modify. Combining marks are more or less as you describe: they may or may not have a graphical representation; they may or may not, but usually do, change the graphical representation of the base character they follow; they are usually, but not always, zero-width. These are the characters you can determine are combining characters by testing whether or not they are zero- width.
What matters is that "combining characters" do not have stand-alone semantic meaning; they should be erased along with the principal character. Accents in European languages (and Thai) tend to be in this category.
This is how readline behaves, and the behavior the OP was reporting as a bug.
To a first approximation, backspace should skip over the latter but not the former. However if you've just removed a zero-width element, it would be advisable to either re-render the whole line, or backspace over the last full glyph, erase it, and re-render it with all its (remaining) modifiers.
If the difference between the old line and the new line is a zero-width character, readline redisplays the base character, but the readline character operations always skip over zero-width combining characters and operate on the base character+combining characters.
On systems that need to dynamically load a shared library (linunicode.so?) to support this, I suggest delaying doing so until it's needed -- after setlocale("something.UTF-8") returns success, or some equivalent test. (I hope there's a check that can be done against the already-loaded locale, rather than inspecting the locale name as a string.)
The original report used C.UTF-8 as the locale; there's not a lot you can tell from that about the semantic properties of characters. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/
OpenPGP_signature.asc
Description: OpenPGP digital signature