(Mail resent with the proper subject. Hi list,
(Please Cc: me when replying, as I'm not subscribed to this list.) I'm working with Unicode strings to handle accented characters but I'm experiencing a few problem. The first one is with regular expression. If I want to match a word composed of characters only. One can easily use '[a-zA-Z]+' when working in ascii, but unfortunately there is no equivalent when working with unicode strings: the latter doesn't match accented characters. The only mean the re package provides is '\w' along with the re.UNICODE flag, but unfortunately it also matches digits and underscore. It appears there is no suitable solution for this currently. Am I right? Secondly, I need to translate accented characters to their unaccented form. I've written this function (sorry if the code isn't as efficient as possible, I'm not a long-time Python programmer, feel free to correct me, I' be glad to learn anything): % def unaccent(s): % """ % """ % % if not isinstance(s, types.UnicodeType): % return s % singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)') % result = '' % for l in s: % desc = unicodedata.name(l) % m = singleletter_re.search(desc) % if m is None: % result += str(l) % continue % result += m.group(1).lower() % return result % But I don't feel confortable with it. It strongly depend on the UCD file format and names that don't contain a single letter cannot obvisouly all be converted to ascii. How would you implement this function? Thank you for your help. Regards, -- Jeremie Le Hen < jlehen at clesys dot fr > ----- End forwarded message ----- -- Jeremie Le Hen < jlehen at clesys dot fr > -- http://mail.python.org/mailman/listinfo/python-list