On Wed, Jul 19, 2017 at 12:09 AM, Random832 <random...@fastmail.com> wrote: > On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote: >> What do you mean about regular expressions? You can use REs with >> normalized strings. And if you have any valid definition of "real >> character", you can use it equally on an NFC-normalized or >> NFD-normalized string than any other. They're just strings, you know. > > I don't understand how normalization is supposed to help with this. It's > not like there aren't valid combinations that do not have a > corresponding single NFC codepoint (to say nothing of the situation with > e.g. Indic languages). > > In principle probably a viable solution for regex would be to add > character classes for base and combining characters, and then > "[[:base:]][[:combining:]]*" can be used as a building block if > necessary.
Once you NFC or NFD normalize both strings, identical strings will generally have identical codepoints. (There are some exceptions, and for certain types of matching, you might want to use NFKC/NFKD instead.) You should then be able to use normal regular expressions to match correctly. I don't know of any situations where you want to match "any base character" or "any combining character"; what you're more likely to want is "match the letter รก", and you don't care whether it's represented as U+0061 U+0301 or as U+00E1. That's where Unicode normalization comes in. ChrisA -- https://mail.python.org/mailman/listinfo/python-list