[issue30838] re \w does not match some valid Unicode characters

2017-07-05 Thread Matthew Barnett
Matthew Barnett added the comment: Python identifiers match the regex: [_\p{XID_Start}]\p{XID_Continue}* The standard re module doesn't support \p{...}, but the third-party "regex" module does. -- ___ Python tracker

[issue30838] re \w does not match some valid Unicode characters

2017-07-05 Thread David Lord
David Lord added the comment: After thinking about it more, I guess I misunderstood what \w was doing compared to isidentifier. Since Python just relies on the Unicode database, there's not much to be done anyway. Closing this. For anyone interested, we ended up with a hybrid approach for lexi

[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread Matthew Barnett
Matthew Barnett added the comment: In Unicode 9.0.0, U+1885 and U+1886 changed from being General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn). U+2118 is General_Category=Math_Symbol (Sm) and U+212E is General_Category=Other_Symbol (So). \w doesn't include Mn, Sm or So.

[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread David Lord
David Lord added the comment: Adding `or ('a' + s).isidentifer()`, to catch valid id_continue characters, to the test in the previous script reveals many more characters that seem like valid word characters but aren't matched by `\w`. -- ___ Python

[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread ThiefMaster
Changes by ThiefMaster : -- nosy: +ThiefMaster ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.pyt

[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread STINNER Victor
Changes by STINNER Victor : -- nosy: +serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https://

[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread David Lord
New submission from David Lord: This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches all valid identifier characters except for 4 special cases: import unicodedata import re import sys cre