On Saturday, October 21, 2017 at 11:51:57 AM UTC+5:30, Chris Angelico wrote: > On Sat, Oct 21, 2017 at 3:25 PM, Stefan Ram wrote: > > Rustom Mody writes: > >>Is there a recommended library for manipulating grapheme clusters? > > > > The Python Library has a module "unicodedata", with functions like: > > > > |unicodedata.normalize( form, unistr ) > > | > > |Returns the normal form »form« for the Unicode string »unistr«. > > |Valid values for »form« are »NFC«, »NFKC«, »NFD«, and »NFKD«. > > > > . I don't know whether the transformation you are looking for > > is one of those. > > No, that's at a lower level than grapheme clusters. > > Rustom, have you looked on PyPI? There are a couple of hits, including > one simply called "grapheme".
There is this one line solution using regex (or 2 char solution!) Not perfect but a good start >>> from regex import findall >>> veda="""ॐ पूर्णमदः पूर्णमिदं पूर्णात्पुर्णमुदच्यते पूर्णस्य पूर्णमादाय पूर्णमेवावशिष्यते ॥ ॐ शान्तिः शान्तिः शान्तिः ॥""" >>> findall(r'\X', veda) ['ॐ', ' ', 'पू', 'र्', 'ण', 'म', 'दः', ' ', 'पू', 'र्', 'ण', 'मि', 'दं', ' ', 'पू', 'र्', 'णा', 'त्', 'पु', 'र्', 'ण', 'मु', 'द', 'च्', 'य', 'ते', '\n', 'पू', 'र्', 'ण', 'स्', 'य', ' ', 'पू', 'र्', 'ण', 'मा', 'दा', 'य', ' ', 'पू', 'र्', 'ण', 'मे', 'वा', 'व', 'शि', 'ष्', 'य', 'ते', ' ', '॥', '\n', 'ॐ', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', '॥'] >>> Compare >>> [x for x in veda] ['ॐ', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'द', 'ः', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ि', 'द', 'ं', ' ', 'प', 'ू', 'र', '्', 'ण', 'ा', 'त', '्', 'प', 'ु', 'र', '्', 'ण', 'म', 'ु', 'द', 'च', '्', 'य', 'त', 'े', '\n', 'प', 'ू', 'र', '्', 'ण', 'स', '्', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ा', 'द', 'ा', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'े', 'व', 'ा', 'व', 'श', 'ि', 'ष', '्', 'य', 'त', 'े', ' ', '॥', '\n', 'ॐ', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', '॥'] What is not working are the vowel-less consonant-joins: ie ... 'र्', 'ण' ... [3,4 element of the findall] should be one 'र्ण' But its good enough for me for now I think PS Stefan I dont see your responses unless someone quotes them. Thanks anyway for the inputs -- https://mail.python.org/mailman/listinfo/python-list