On Wed, Jul 19, 2017 at 4:56 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Chris Angelico <ros...@gmail.com>: >> What I *think* you're asking for is for square brackets in a regex to >> count combining characters with their preceding base character. > > Yes. My example tries to match a single character against a single > character. > >> That would make a lot of sense, and would actually be a reasonable >> feature to request. (Probably as an option, in case there's a backward >> compatibility issue.) > > There's the flag re.IGNORECASE. In the same vein, it might be useful to > have re.IGNOREDIACRITICS, which would match > > re.match("^[abc]$", "ä", re.IGNOREDIACRITICS) > > regardless of normalization.
That's a different feature, and can be achieved with a different normalization: def fold(s): """Fold a string for 'search compatibility'. Returns a modified version of s with no diacriticals. """ s = s.casefold() s = unicodedata.normalize("NFKD", s) s = ''.join(c for c in s if c < '\u0300' or c > '\u033f') return unicodedata.normalize("NFKC", s) This is something that you might use when searching, as people will expect to be able to type "cafe" to fine "café". It is deliberately lossy. But having the re module group code units into logical characters according to 'base + combining' is a different feature. It may be worth adding. I don't think your re.IGNOREDIACRITICS is something that belongs in the stdlib, as different search contexts require different folding (Google, for instance, will find "ı" when you search for "i" - but then, Google also finds "python" when you search for "phyton"). ChrisA -- https://mail.python.org/mailman/listinfo/python-list