Bugs item #1611131, was opened at 2006-12-07 23:44 Message generated for change (Comment added) made by akaihola You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1611131&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Regular Expressions Group: Python 2.5 >Status: Deleted >Resolution: Invalid Priority: 5 Private: No Submitted By: akaihola (akaihola) Assigned to: Gustavo Niemeyer (niemeyer) Summary: \b in unicode regex gives strange results Initial Comment: The problem: This doesn't give a match: >>> re.match(r'ä\b', 'ä ', re.UNICODE) This works ok and gives a match: >>> re.match(r'.\b', 'ä ', re.UNICODE) Both of these work as well: >>> re.match(r'a\b', 'a ', re.UNICODE) >>> re.match(r'.\b', 'a ', re.UNICODE) Docs say \b is defined as an empty string between \w and \W. These do match accordingly: >>> re.match(r'\w', 'ä', re.UNICODE) >>> re.match(r'\w', 'a', re.UNICODE) >>> re.match(r'\W', ' ', re.UNICODE) So something strange happens in my first example, and I can't help but assume it's a bug. ---------------------------------------------------------------------- >Comment By: akaihola (akaihola) Date: 2006-12-14 02:30 Message: Logged In: YES user_id=1432932 Originator: YES Ok so this does work: >>> re.match(ur'ä\b', u'ä ', re.UNICODE) If I understand correctly, I was comparing UTF-8 encoded strings in my examples (my Ubuntu is UTF-8 by default) and regex special operators just don't work in that domain. ---------------------------------------------------------------------- Comment By: Georg Brandl (gbrandl) Date: 2006-12-08 22:51 Message: Logged In: YES user_id=849994 Originator: NO FWIW, the first example works fine for me with and without Unicode strings. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2006-12-08 19:18 Message: Logged In: YES user_id=21627 Originator: NO Notice that the re.UNICODE flag is only meaningful if you are using Unicode strings; in the examples you give, you are using byte strings. Please re-test with Unicode strings both as the expression and as the string to match. ---------------------------------------------------------------------- Comment By: akaihola (akaihola) Date: 2006-12-08 00:18 Message: Logged In: YES user_id=1432932 Originator: YES As a work-around I currently use a regex like r'ä(?=\W)'. Seems to work ok. Also, the \b problem doesn't seem to exist in the \W\w case, i.e. at the beginning of words. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1611131&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com