On Mon, Jun 8, 2015 at 12:58 AM, <random...@fastmail.us> wrote: > On Sun, Jun 7, 2015, at 07:42, Steven D'Aprano wrote: >> The question of graphemes (what "ordinary people" consider letters and >> characters, e.g. "ch" is two letters to an English speaker but one letter >> to a Czech speaker) should be left to libraries. > > Do Czech speakers expect to be able to select and delete it as a single > unit and never have the cursor in the middle of it? If not, then this is > not really fundamentally the same thing as what we have with combining > characters or certain sequences of Indic letters.
Not sure about Indic letters, but with combining characters, you *should* select and delete a single unit containing a base character and all its combining characters, and you should never have the cursor in the middle of it. (Not everything gets this right; SciTE, though otherwise a decent text editor, does allow the cursor to go inside combining characters.) But I suspect that with the Czech "ch", like the Dutch "ij" and the German "oe" (when it's not ö), should be treated as two separate characters. Digression: English has seventy phonograms, which are what words are really built out of. Digraphs like "th" and "sh", represent single sounds despite being spelled with multiple letters - but nobody ever expects them to be treated as single character units just because other languages spell them "þ" or "ş". The alphabet of English includes "q", which is not a phonogram on its own ("qu" is), and doesn't include all the digraphs, and any character-based representation of English should correspondingly work with letters, not phonograms. I don't know Czech enough to be able to say whether "ch" is more like a single letter or a phonogram, but even if it basically functions as a letter, I suspect that treating it as two characters will be no surprise to most people. ChrisA
-- https://mail.python.org/mailman/listinfo/python-list