[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Terry J. Reedy Sat, 27 Aug 2011 13:28:49 -0700

Terry J. Reedy <tjre...@udel.edu> added the comment:

Python makes it easy to transform a sequence with a generator as long as no 
look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a 
surrogate is found, grab the matching one.


However, grapheme clustering does require look-ahead, which is a bit trickier. 
Assume s is a sanitized sequence of code points with unicode database entries. 
Ignoring line endings the following should work (I tested it with a toy 
definition of mark()):

def graphemes(s):
  sit = iter(s)
  try: graph = [next(sit)]
  except StopIteration: graph = []

  for cp in sit:
    if mark(cp):  
      graph.append(cp)
    else:
      yield combine(graph)
      graph = [cp]

  yield combine(graph)

I tested this with several input with
def mark(cp): return cp == '.'
def combine(l) return ''.join(l)

Python's object orientation makes formatting easy for the user. Assume someone 
does the hard work of writing (once ;-) a GCString class with a .__format__ 
method that interprets the format mini-language for graphemes, using a 
generalized version of your 'simply horrible' code. The might be done by 
adapting str.__format__ to use the grapheme iterator above. Then users should 
be able to write

>>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠"))
"a̠ˈne̞ɣ̞ð̞"
(Note: Thunderbird properly displays characters with the marks beneath even 
though FireFox does not do so above or in its display of your message.)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to