2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode < unicode@unicode.org>:
> Thus, at the level of undisputable text, in Indic scripts there appears > to be no provision for the ordering of multiple left matras that are > to be stored in logical order (i.e. backing order) after the onset > consonants. (Thus, this is not a problem for the Thai script.) > Fortunately, there is no good evidence that the occurrence of multiple > distinct left matras is anything but a typing error, though I can easily > see how it might be used as a lexicographical convention on the fuzzy > edge of plain text. > > In a similar vein, in Malayalam, we get repeats of the 2-part vowel > U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at > https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html > ), > but I'm not sure what the legitimate encodings of the example word > കോോോ (typed here as <U+0D15, U+0D4B, U+0D4B, U+0D4B>) are. > Even if there were typing errors, the input method should either signal it visually to the user (using canonical reordering), or the user could still cancel this reordering (e.g. CTRL+Z for undoing it) and the input method could still fix it and mainting the order by then inserting combining joiners automatically even if the user did not enter them directly. The joiners should better be removed transparently by the text editor without requiring the user to perform complex selections or pressing BACKSPACE multiple times, as I don't see any use of these joiners at end of graphemes, or multiple joiners in a sequence. Then the user can even click in the middle of the uncommon sequences of matras, to correct a missing consonnant if needed: here also the joiner that is encoded but hidden there would be dropped automatically. If there are specific sequences requiring other uses of joiners for useful distinction in some pairs of letters or diacritics, the input editor could offer a way to enter the sequence directly or to change the encoding of that pair with or without the joiner in the middle. Having to retype completely the matra (using BACKSPACE deleting transparently the joiners, or using normal text selection over full clusters) should be the exception. If such special sequences requiring joiners are frequent, there should be a way to enter that sequence directly for the target language, the input editor could propose it with a point and clik/touch palette or some function/control keys or contextual menu when selecting a candidate occurance where alternate encodings are possible and known (possibly registered by the user himself within his own input preferences or in his personal lexical file of alternate words where they would have been when they deviate from the most common orthographic rules). Which UI widget or function key will be used by the input editor is left to the system or application UI. But the system should not decide alone that a sequence is invalid for some orthographic system, when Unicode provides valid ways to deviate from any ortographic system and to bypass the common canonical equivalences by adding some transparent controls. Even for Latin, one can freely enter SHY controls at any place within words, even if they are not at correct syllabic separations: this will impact the rendering if there are linebreaks, but this is done on purpose, and still easy to correct if this was made by error (a spell checker could also help locate these uncommons errors in existing texts but would not automatically correct them without instruction given by the user and a user can also choose to ignore/discard these signals and store the text as is). Whever the text with uncommon sequences will be easy to render correctly is not the problem, the editor will jsut attempt to give a best effort representation, and if this approximative representation is too frequent, fonts and renderers will be updated later to support and reder correctly the "uncommon" sequence (without even needing to change the Unicode standard itself). But inputing such text will not be blocked. The case of confusable two-part vowels in Indic scripts however causes a problem of interpretation and it's not reasonable to think that users will use one sequence instead of the other, when both would render the same with the existing typographic rules implemented in renderers, but they collate differently (this may be a problem for plain-text searches if we look for distinctions, or sorting, but this can be fixed by definining collation strengths or search flags to apply or not some collation equivalences, by enabling or disabling some tailorings), and then this can help setup a spell checker to signal or ignore some suggested corrections.