Re: Bug: Ligatures are removed as one character

Chet Ramey Mon, 26 Feb 2024 06:11:57 -0800

On 2/25/24 5:37 AM, Martin D Kealey wrote:

Unicode has categories for "modifiers" (especially "modifier letters") and
for "combining characters". Note that each symbol can be in multiple
categories.


Modifiers change how another character is displayed. They may or may not be
considered to have their own separate semantic meaning. In the simple cases
they simply over-print an additional mark, but more complex adjustments are
possible. They don't normally change the overall size of the modified
character, so wcwidth(ch) will report zero.


Not quite, as I understand it in the sense Unicode uses them.

Unicode modifier characters are base characters in their own right, are
not combining characters, and do not change the graphical representation
of the base character they modify.

Combining marks are more or less as you describe: they may or may not have
a graphical representation; they may or may not, but usually do, change
the graphical representation of the base character they follow; they are
usually, but not always, zero-width. These are the characters you can
determine are combining characters by testing whether or not they are zero-
width.

What matters is that "combining characters" do not have stand-alone semantic
meaning; they should be erased along with the principal character. Accents
in European languages (and Thai) tend to be in this category.


This is how readline behaves, and the behavior the OP was reporting as
a bug.

To a first approximation, backspace should skip over the latter but not the
former. However if you've just removed a zero-width element, it would be
advisable to either re-render the whole line, or backspace over the last
full glyph, erase it, and re-render it with all its (remaining) modifiers.


If the difference between the old line and the new line is a zero-width
character, readline redisplays the base character, but the readline
character operations always skip over zero-width combining characters
and operate on the base character+combining characters.

On systems that need to dynamically load a shared library (linunicode.so?)
to support this, I suggest delaying doing so until it's needed -- after
setlocale("something.UTF-8") returns success, or some equivalent test. (I
hope there's a check that can be done against the already-loaded locale,
rather than inspecting the locale name as a string.)


The original report used C.UTF-8 as the locale; there's not a lot you can
tell from that about the semantic properties of characters.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: Bug: Ligatures are removed as one character

Reply via email to