Chris Angelico <ros...@gmail.com>: > To be quite honest, I wouldn't care about that possibility. If I could > design regex semantics purely from an idealistic POV, I would say that > [xyzã], regardless of its encoding, will match any of the four > characters "x", "y", "z", "ã". > > Earlier I posted a suggestion that a folding function be used when > searching (for instance, it can case fold, NFKC normalize, etc). > Unfortunately, this makes positional matching extremely tricky; if > normalization changes the number of code points in the string, you > have some fiddly work to do to try to find back the match location in > the original (pre-folding) string. That technique works well for > simple lookups (eg "find me all documents whose titles contain <this > string>"), but a regex does more than that. As such, I am in favour of > the regex engine defining a "character" as a base with all subsequent > combining, so a single dot will match the entire combined character, > and square bracketed expressions have the same meaning whether you're > NFC or NFD normalized, or not normalized. However, that's the ideal > situation, and I'm not sure (a) whether it's even practical to do > that, and (b) how bad it would be in terms of backward compatibility.
Here's a proposal: * introduce a building (predefined) class Text * conceptually, a Text object is a sequence of "real" characters * you can access each "real" character by its position in O(1) * the "real" character is defined to be a integer computed as follows (in pseudo-Python): string = the NFC normal form of the real character as a string rc = 0 shift = 0 for codepoint in string: rc |= ord(codepoing) << shift shift += 6 return rc * t[n] evaluates to an integer * the Text constructor takes a string or an integer * str(Text) evaluates to the NFC encoding of the Text object * Text.encode(...) works like str(Text).encode(...) * regular expressions work with Text objects * file system functions work with Text objects Instead of introducing Text, all of this could also be done within the str class itself: * conceptually, an str object is a sequence of integers representing Unicode code points *or* "real" characters * ord(s) returns the code point or the integer (rc) from the algorithm above * chr(n) takes a valid code point or an rc value as defined above * s.canonical() returns a string that has merged all multi-code-point characters into single "real" characters Each approach has its upsides and downsides. Marko -- https://mail.python.org/mailman/listinfo/python-list