Terry J. Reedy <tjre...@udel.edu> added the comment: Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one.
However, grapheme clustering does require look-ahead, which is a bit trickier. Assume s is a sanitized sequence of code points with unicode database entries. Ignoring line endings the following should work (I tested it with a toy definition of mark()): def graphemes(s): sit = iter(s) try: graph = [next(sit)] except StopIteration: graph = [] for cp in sit: if mark(cp): graph.append(cp) else: yield combine(graph) graph = [cp] yield combine(graph) I tested this with several input with def mark(cp): return cp == '.' def combine(l) return ''.join(l) Python's object orientation makes formatting easy for the user. Assume someone does the hard work of writing (once ;-) a GCString class with a .__format__ method that interprets the format mini-language for graphemes, using a generalized version of your 'simply horrible' code. The might be done by adapting str.__format__ to use the grapheme iterator above. Then users should be able to write >>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠")) "a̠ˈne̞ɣ̞ð̞" (Note: Thunderbird properly displays characters with the marks beneath even though FireFox does not do so above or in its display of your message.) ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com