On Tue, 22 Oct 2019 11:04:01 +0200 Daniel Bünzli via Unicode <unicode@unicode.org> wrote:
> On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode > (unicode@unicode.org) wrote: > > > When it comes to the second sentence of the text of Slide 7 > > 'Grapheme Clusters', my overwhelming reaction is one of extreme > > anger. Slide 8 does nothing to lessen the offence. The problem is > > that it gives the impression that in general it is acceptable for > > backspace to delete the whole grapheme cluster. > > Let's turn extreme anger into knowledge. > > I'm not very knowledgable in ligature heavy scripts (I suspect that's > what you refer to) and what you describe is the first thing I went > with for a readline editor data structure. Not necessarily ligature-heavy, but heavy in combining characters. Examples at the light end include IPA and pointed Hebrew. The Thai script is another fairly well-known one but Siamese itself doesn't use more than two marks on a consonant. (The vowel marks before and after don't count - they work like letters.) > Would maybe care to expand when exactly you think it's not acceptable > and what kind of tools or standard I can find the Unicode toolbox to > implement an acceptable behaviour for backspace on general Unicode > text. The compromise that has generally been reached is that 'delete' deletes a grapheme cluster and 'backspace' deletes a scalar value. (There are good editors like Emacs that delete only a single character.) The rationale for this is that backspace undoes the effect of a keystroke. For a perfect match, the keyboard would need to handle the backspace - and everyone editing the text would have to use compatible keyboards! That's not a very plausible scenario for a Wikipedia article. Now, deleting the last character is not very Unicode compliant; there is a family of keyboard designs in development that by default deletes the last character in NFC form if it is precomposed and otherwise the last character in NFD forms. UTS#35 Issue 36 Part 7 Section 5.21 allows for more elaborate behaviours. I would contend that deleting the last character is the best simple approximation. However, it's not impossible for a dead key implementation to decide that dead acute plus 'e' should be emitted as two characters, even though its more usual for it to be emitted as a single character. Now, there are cases where one may be unlikely to type a single character. I can imagine a variation sequence or being implemented as a 'ligature', i.e. a single stroke (or IME selection action) yielding the entry of a base character plus variation selector. Emoji may be another, though I must say I would probably enter a regional indicator pair as two characters, and expect to be able to delete just the last if I made an error, contra Davis 2019. While stacker + consonant might be expected to be a unit, the original designs envisaged them being a sequence. Additionally, I would expect an edit to change the subscripted consonant rather than remove it. In this case, delete last character and delete grapheme cluster agree for the language-independent rules. Richard.