MRAB wrote:
Terry Reedy wrote:
I notice from the manual "All identifiers are converted into the
normal form NFC while parsing; comparison of identifiers is based on
NFC." If NFC used accented letters, then the issue is finesses away
for European words simply because Unicode includes includes combined
characters for European scripts but not for south Asian scripts.
Does that mean that the re module will need to convert both the pattern
and the text to be searched into NFC form first?
The quote says that Python3 internally converts all identifiers in
source code to NFC before compiling the code, so it can properly compare
them. If this was purely an internal matter, this would not need to be
said. I interpret the quote as a warning that a programmer who wants to
compare a 3.0 string to an identifier represented as a string is
responsible for making sure that *his* string is also in NFC. For instance:
ident = 3
...
if 'ident' in globals(): ...
The second ident must be NFC even if the programmer prefers and
habitually writes another form because, like it or not, the first one
will be turned into NFC before insertion into the code object and later
into globals().
So my thought is that re should take the strings as given, but that the
re doc should warn about logically equal forms not matching. (Perhaps
it does already; I have not read it in years.) If a text uses a
different normalization form, which some surely will, the programmer is
responsible for using the same in the re pattern.
And I'm still not clear
whether \w, when used on a string consisting of Lo followed by Mc,
should match Lo and then Mc (one codepoint at a time) or together (one
character at a time, where a character consists of some base character
codepoint possibly followed by modifier codepoints).
Programs that transform text to glyphs may have to read bundles of
codepoints before starting to output, but my guess is that re should do
the simplest thing and match codepoint by codepoint, assuming that is
what it currently does. I gather that would just mean expanding the
current definition of word char. But I would look at TR18 and see what
Martin says.
I ask because I'm working on the re module at the moment.
Great. I *think* that the change should be fairly simple
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list