On Mon, 11 Dec 2017 21:45:23 +0000 Cibu Johny (സിബു) <c...@google.com> wrote:
> I am assuming the purpose of the grapheme cluster definition is to be > used line spacing, vertical writing or cursor movement. Without > defining the purpose, it is hard for me to say if a ruleset is valid > or not. That is a very fair point. Take the example of Thai, an Indic script which isn't affected by the proposal. There, the spacing vowel signs, whether before or after, may undergo greater separation when text is stretched to fill a space. I've seen great separation on hoardings. The spacing vowel signs are given gc=Lo. Vertical writing examples are fairly rare, but I've seen 'Yamaha' written vertically in three horizontal stretches - ยา มา หา. Also, 'video' may be written vertically in three horizontal stretches, as V D O or as วิ ดี โอ. I'm not absolutely sure I've the latter in Thai script, but Glenn Slayden reports it at http://www.thai-language.com/phpbb/viewtopic.php?f=11&t=2568&start=0. The striking thing is that four of these syllables have spacing vowels, which would be written on their own in writing stretched horizontally, but associate with the consonant in vertical writing. I haven't checked on the software-free behaviour of U+0E33 THAI CHARACTER SARA AM, which is historically a combination of a mark above and a mark to the right. The Royal Institute Dictionary of 1999 resolves it into NIKHAHIT and SARA AA for what is a very slight horizontal spacing (e.g. the entry for กระบาล, but I have seen the NIKHAHIT component still attached to the SARA AA component. However, I don't know how much control the RID had over the typesetting of the dictionary. I think making the proposed change and still saying that cursor motion should follow the extended grapheme cluster boundaries is contrary to the Equality Act 2010. It would be knowingly making text editing harder for the users of most Indic scripts. Those writing a Tai language in the Tai Tham script would be hit hardest, even if one mapped compound vowels to simple key stroke sequences. > Assuming that purpose driven definition, we probably need > language specific definitions - a pan-indic algorithm may not work. There is the intermediate level of script-specific definitions. We already have them - following spacing marks are generally excluded from the grapheme clusters in the Burmic scripts. > For instance, the proposed ruleset, may not hold good for Tamil. For > example, see the title in the following image: துக்ளக் broken as > [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed > algorithm it would be: [ta-u, ka-virama-lla, ka-virama] > > [image: image.png] > http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg Thank you for the example. I think the rule for the Tamil script should be that pulli attaches a following consonant to its grapheme cluster only in the case of the sequences க்ஷ and ஶ்ரீ, but as I typed the latter, I was surprised to see the sequence ஶ்ர adopt a conjunct shape, so I don't know whether I'm seeing variation or a font error. > Malayalam could be a similar story. In case of Malayalam, it can be > font specific because of the existence of traditional and reformed > writing styles. A conjunct might be a ligature in traditional; and it > might get displayed with explicit virama in the reformed style. For > example see the poster with word ഉസ്താദ് broken as [u, sa-virama, > ta-aa, da-virama] - as it is written in the reformed style. As per > the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. > These breaks would be used by the traditional style of writing. It seems that the of UAX#29 have been forgotten - "So tailorings for aksaras may need to be script-, language-, font-, or context-specific to be useful". The big problem is that virama leaves too much up to the font. Richard.