New submission from Tom Christiansen <tchr...@perl.com>:

Without proper grapheme support in the regular expression library, it is 
impossible to correctly process Unicode.  And the very least, one needs the \X 
escape supported, which is an extended grapheme cluster per UTS#18. This escape 
is supported by many regex libraries, include Perl's own and of course PCRE 
(and thence PHP, the standard ICU library, and Matthew Barnett's replacement 
regex library for Python.

How do you process a string by graphemes if you cannot split on \X?  How can 
you avoid splitting a grapheme into silly pieces if you cannot match one?  How 
do I match the letter O no matter what diacritics have been applied to it 
otherwise?  A match of (?=O)\X against an NFD string is by far the simplest and 
best way.

This is necessary for a wide variety of reasons.  Adding \pM and \PM go a 
little ways, but not far enough, because that is not how grapheme clusters are 
defined.  You need a proper \X.

----------
components: Regular Expressions
messages: 141924
nosy: tchrist
priority: normal
severity: normal
status: open
title: Request for grapheme support in Python re lib
type: feature request
versions: Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12733>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to